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Preface to the First Edition 


Within the framework of the classical linear model it is a fairly straight- 
forward matter to establish the properties of the ordinary least squares 
(OLS) and generalized least squares (GLS) estimators for samples of any 
size. Although the classical linear model is an excellent framework for de- 
veloping a feel for the statistical techniques of estimation and inference 
that are central to econometrics, it is not particularly well adapted to the 
study of economic phenomena, because economists usually cannot conduct 
controlled experiments. Instead, the data usually exist as the outcome of 
a stochastic process outside the control of the investigator. For this rea- 
son, both the dependent and the explanatory variables may be stochastic, 
and equation disturbances may exhibit nonnormality or heteroskedasticity 
and serial correlation of unknown form, so that the classical assumptions 
are violated. Over the years a variety of useful techniques has evolved to 
deal with these difficulties. Many of these amount to straightforward mod- 
ifications or extensions of the OLS techniques (e.g., the Cochrane-Orcutt 
technique, two-stage least squares, and three-stage least squares). However, 
the finite sample properties of these statistics are rarely easy to establish 
outside of somewhat limited special cases. Instead, their usefulness is jus- 
tified primarily on the basis of their properties in large samples, because 
these properties can be fairly easily established using the powerful tools 
provided by laws of large numbers and central limit theory. 

Despite the importance of large sample theory, it has usually received 
fairly cursory treatment in even the best econometrics textbooks. This is 
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really no fault of the textbooks, however, because the field of asymptotic 
theory has been developing rapidly. It is only recently that econometricians 
have discovered or established methods for treating adequately and com- 
prehensively the many different techniques available for dealing with the 
difficulties posed by economic data. 

This book is intended to provide a somewhat more comprehensive and 
unified treatment of large sample theory than has been available previ- 
ously and to relate the fundamental tools of asymptotic theory directly to 
many of the estimators of interest to econometricians. In addition, because 
economic data are generated in a variety of different contexts (time series, 
cross sections, time series-cross sections), we pay particular attention to the 
similarities and differences in the techniques appropriate to each of these 
contexts. 

That it is possible to present our results in a fairly unified manner high- 
lights the similarities among a variety of different techniques. It also allows 
us in specific instances to establish results that are somewhat more gen- 
eral than those previously available. We thus include some new results in 
addition to those that are better known. 

This book is intended for use both as a reference and as a textbook for 
graduate students taking courses in econometrics beyond the introductory 
level. It is therefore assumed that the reader is familiar with the basic 
concepts of probability and statistics as well as with calculus and linear 
algebra and that the reader also has a good understanding of the classical 
linear model. 

Because our goal here is to deal primarily with asymptotic theory, we 
do not consider in detail the meaning and scope of econometric models per 
se. Therefore, the material in this book can be usefully supplemented by 
standard econometrics texts, particularly any of those listed at the end of 
Chapter 1. 

I would like to express my appreciation to all those who have helped in the 
evolution of this work. In particular, I would like to thank Charles Bates, 
Ian Domowitz, Rob Engle, Clive Granger, Lars Hansen, David Hendry, 
and Murray Rosenblatt. Particular thanks are due Jeff Wooldridge for his 
work in producing the solution set for the exercises. I also thank the stu- 
dents in various graduate classes at UCSD, who have served as unwitting 
and indispensable guinea pigs in the development of this material. I am 
deeply grateful to Annetta Whiteman, who typed this difficult manuscript 
with incredible swiftness and accuracy. Finally, I would like to thank the 
National Science Foundation for providing financial support for this work 
under grant SES81-07552. 


Preface to the Revised Edition 


It is a gratifying experience to be asked to revise and update a book written 
over fifteen years previously. Certainly, this request would be unnecessary 
had the book not exhibited an unusual tenacity in serving its purpose. Such 
tenacity had been my fond hope for this book, and it is always gratifying 
to see fond hopes realized. 

It is also humbling and occasionally embarrassing to perform such a 
revision. Certain errors and omissions become painfully obvious. Thoughts 
of “How could I have thought that?” or “How could I have done that?” 
arise with regularity. Nevertheless, the opportunity is at hand to put things 
right, and it is satisfying to believe that one has succeeded in this. (I know, 
of course, that errors still lurk, but I hope that this time they are more 
benign or buried more deeply, or preferably both.) 

Thus, the reader of this edition will find numerous instances where defini- 
tions have been corrected or clarified and where statements of results have 
been corrected or made more precise or complete. The exposition, too, has 
been polished in the hope of aiding clarity. 

Not only is a revision of this sort an opportunity to fix prior shortcom- 
ings, but it is also an opportunity to bring the material covered up-to-date. 
In retrospect, the first edition of this book was more ambitious than origi- 
nally intended. The fundamental research necessary to achieve the intended 
scope and cohesiveness of the overall vision for the work was by no means 
complete at the time the first edition was written. For example, the central 
limit theory for heterogeneous mixing processes had still not developed to 
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the desired point at that time, nor had the theories of optimal instrumental 
variables estimation or asymptotic covariance estimation. 

Indeed, the attempt made in writing the first edition to achieve its in- 
tended scope and coherence revealed a host of areas where work was needed, 
thus providing fuel for a great deal of my own research and (I like to think) 
that of others. In the years intervening, the efforts of the econometrics re- 
search community have succeeded wonderfully in delivering results in the 
areas needed and much more. Thus, the ambitions not realized in the first 
edition can now be achieved. If the theoretical vision presented here has 
not achieved a much better degree of unity, it can no longer be attributed 
to a lack of development of the field, but is now clearly identifiable as the 
author’s own responsibility. 

As a result of these developments, the reader of this second edition will 
now find much updated material, particularly with regard to central limit 
theory, asymptotically efficient instrumental variables estimation, and esti- 
mation of asymptotic covariance matrices. In particular, the original Chap- 
ter 7 (concerning efficient estimation with estimated error covariance ma- 
trices) and an entire section of Chapter 4 concerning efficient IV estimation 
have been removed and replaced with much more accessible and coherent 
results on efficient IV estimation, now appearing in Chapter 4. 

There is also the progress of the field to contend with. When the first 
edition was written, cointegration was a subject in its infancy, and the 
tools needed to study the asymptotic behavior of estimators for models of 
cointegrated processes were years away from fruition. Indeed, results of De- 
Jong and Davidson (2000) essential to placing estimation for cointegrated 
processes cohesively in place with the theory contained in the first six chap- 
ters of this book became available only months before work on this edition 
began. 

Consequently, this second edition contains a completely new Chapter 7 
devoted to functional central limit theory and its applications, specifically 
unit root regression, spurious regression, and regression with cointegrated 
processes. Given the explosive growth in this area, we cannot here achieve 
a broad treatment of cointegration. Nevertheless, in the new Chapter 7 
the reader should find all the basic tools necessary for entrée into this 
fascinating area. 

The comments, suggestions, and influence of numerous colleagues over 
the years have had effects both subtle and patent on the material pre- 
sented here. With sincere apologies to anyone inadvertently omitted, I ac- 
knowledge with keen appreciation the direct and indirect contributions to 
the present state of this book by Takeshi Amemiya, Donald W. K. An- 
drews, Charles Bates, Herman Bierens, James Davidson, Robert DeJong, 
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Ian Domowitz, Graham Elliott, Robert Engle, A. Ronald Gallant, Arthur 
Goldberger, Clive W. J. Granger, James Hamilton, Bruce Hansen, Lars 
Hansen, Jerry Hausman, David Hendry, Søren Johansen, Edward Leamer, 
James Mackinnon, Whitney Newey, Peter C. B. Phillips, Eugene Savin, 
Chris Sims, Maxwell Stinchcombe, James Stock, Mark Watson, Kenneth 
West, and Jeffrey Wooldridge. Special thanks are due Mark Salmon, who 
originally suggested writing this book. UCSD graduate students who helped 
with the revision include Jin Seo Cho, Raffaella Giacomini, Andrew Pat- 
ton, Sivan Ritz, Kevin Sheppard, Liangjun Su, and Nada Wasi. I also thank 
sincerely Peter Reinhard Hansen, who has assisted invaluably with the cre- 
ation of this revised edition, acting as electronic amanuensis and editor, 
and who is responsible for preparation of the revised set of solutions to the 
exercises. Finally, I thank Michael J. Bacci for his invaluable logistical sup- 
port and the National Science Foundation for providing financial support 
under grant SBR-9811562. 


Del Mar, CA 
July, 2000 
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CHAPTER 1 


The Linear Model and Instrumental 
Variables Estimators 


The purpose of this book is to provide the reader with the tools and con- 
cepts needed to study the behavior of econometric estimators and test 
statistics in large samples. Throughout, attention will be directed to esti- 
mation and inference in the framework of a linear stochastic relationship 
such as 


Y=X B ten t= ent 


where we have n observations on the scalar dependent variable Y; and 
the vector of explanatory variables X; = (Xt1, Xt2,... , Xtk). The scalar 
stochastic disturbance €; is unobserved, and 8, is an unknown k x 1 vector 
of coefficients that we are interested in learning about, either through esti- 
mation or through hypothesis testing. In matrix notation this relationship 
is written as 


Y = X6, +e: 


where Y is an n x 1 vector, X is an n x k matrix with rows X}, and e€ is 
an n x 1 vector with elements €;. 

(Our notation embodies a convention we follow throughout: scalars will 
be represented in standard type, while vectors and matrices will be repre- 
sented in boldface. Throughout, all vectors are column vectors.) 

Most econometric estimators can be viewed as solutions to an optimiza- 
tion problem. For example, the ordinary least squares estimator is the value 
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for B that minimizes the sum of squared residuals 


SSR(B) = (Y - X8) (Y — XB) 
= J (Y - X16)? 
t=1 
The first-order conditions for a minimum are 


dSSR(B)/dOB = See) 
= -xy 'B)=0. 


If X’X = J, XX; is nonsingular, this system of k equations in k 
unknowns can be uniquely solved for the ordinary least squares (OLS) 
estimator 


a 


Ê, = (X'X) ` X'Y 
n —1 n 

= (Sx) So KN. 
t=1 t=1 


Our interest centers on the behavior of estimators such as Bn as n grows 
larger and larger. We seek conditions that will allow us to draw conclusions 
about the behavior of Bn; for example, that Bn has a particular distribution 
or certain first and second moments. 

The assumptions of the classical linear model allow us to draw such 
conclusions for any n. These conditions and results can be formally stated 
as the following theorem. 


Theorem 1.1 The following are the assumptions of the classical linear 
model. 


(i) The data are generated as Y; = X!Bo +€, t=1,... n, Bo E RE. 
(ii) X is a nonstochastic and finite n x k matriz, n > k. 
(iii) X'X is nonsingular. 
(iv) E(e) = 
(v) e ~ N(0, a a2 < o. 
) 


(a 


(Existence) Given (i)-(tii), B„ exists and is unique. 
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(b) (Unbiasedness) Given (i)-(iv), E(Bn) = Bo. 
(c) (Normality) Given (i)-(v), Ba ~ N(8,,02(X’X)7?). 


(d) (Efficiency) Given (i)-(v), Bn is the marimum likelihood estimator 
and is the best unbiased estimator in the sense that the variance - 
covariance matriz of any other unbiased estimator exceeds that of B., 
by a positive semidefinite matriz, regardless of the value of B.. 


Proof. See Theil (1971, Ch. 3). m 


In the statement of the assumptions above, E(-) denotes the expected 
value operator, and € ~ N(0,021) means that e is distributed as (~) mul- 
tivariate normal with mean vector zero and covariance matrix 721, where 
I is the identity matrix. 

The properties of existence, unbiasedness, normality, and efficiency of 
an estimator are the small sample analogs of the properties that will be 
the focus of interest here. Unbiasedness tells us that the distribution of 
Â, is centered around the unknown true value @,, whereas the normality 
property allows us to construct confidence intervals and test hypotheses 
using the t- or F-distributions (see Theil, 1971, pp. 130-146). The eff- 
ciency property guarantees that our estimator has the greatest possible 
precision within a given class of estimators and also helps ensure that tests 
of hypotheses have high power. 

Of course, the classical assumptions are rather stringent and can easily 
fail in situations faced by economists. Since failures of assumptions (iii) 
and (iv) are easily remedied (exclude linearly dependent regressors if (iii) 
fails; include a constant in the model if (iv) fails), we will concern ourselves 
primarily with the failure of assumptions (ii) and (v). The possible failure 
of assumption (i) is a subject that requires a book in itself (see, e.g., White, 
1994) and will not be considered here. Nevertheless, the tools developed in 
this book will be essential to understanding and treating the consequences 
of the failure of assumption (i). 

Let us briefly examine the consequences of various failures of assump- 
tions (iz) or (v). First, suppose that e exhibits heteroskedasticity or serial 
correlation, so that E(ee’) = Q 4 aI. We have the following result for the 
OLS estimator. 


Theorem 1.2 Suppose the classical assumptions (1)-(iv) hold but replace 
(v) with (v) e ~ N(0,Q), Q finite and nonsingular. Then (a) and (b) hold 
as before, (c) is replaced by 

(c’) (Normality) Given (i)-(v’), 


Ên ~ N(Bo,(X'X)*X'QX(X'X)~), 
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and (d) does not hold; that is, B, is no longer necessarily the best unbiased 
estimator. 


Proof. By definition, 3,, = (X'X)~!X’Y. Given (i), 
Ên = Bo + (X'X)"*X'e, 


where (X’X)~!X’e is a linear combination of jointly normal random vari- 
ables and is therefore jointly normal with 


E((X'X) X'e) = (X’XK)7'XK’E(e) = 0, 
given (ii) and (iv) and 


var(X'X) X'e = E((X'X) 'X'ee'X(X'X)`}) 
(X'X)'X'E(ee')X(X'X)™ 
(X'X)'X'AX(X'X) `}, 


given (ii) and (v’). Hence 3,, ~ N(@,,(X'X)7!X’QX(X’X)~!). That (d) 
does not hold follows because there exists an unbiased estimator with a 
smaller covariance matrix than Bn namely, 6% = (X’Q7'X)-?XK’N7'Y. 
We examine its properties next. W 


As long as {2 is known, the presence of serial correlation or hetero- 
skedasticity does not cause problems for testing hypotheses or constructing 
confidence intervals. This can still be done using (c’), although the failure 
of (d) indicates that the OLS estimator may not be best for these purposes. 
However, if 2 is unknown (apart from a factor of proportionality), testing 
hypotheses and constructing confidence intervals is no longer a simple mat- 
ter. One might be able to construct tests based on estimates of Q, but the 
resulting statistics may have very complicated distributions. As we shall see 
in Chapter 6, this difficulty is lessened in large samples by the availability 
of convenient approximations based on the central limit theorem and laws 
of large numbers. 

If Q is known, efficiency can be regained by applying OLS to a linear 
transformation of the original model, i.e., 


Cly =C"!X8,+Ce 
OT 


yr — X*B, +e", 
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where Y* = C7lY, X* = C7!X, e* = C™!le and C is a nonsingular 
factorization of Q such that CC’ = Q so that C7!NC~" =I. This trans- 
formation ensures that E(e*e*’) = E(C-lee’C-") = C7IE(ee')CT" = 
C-!QC~-" =I, so that assumption (v) once again holds. The least squares 
estimator for the transformed model is 


a k 


Bhn = (XRK Ay 
(XC C X) XC VEY 
(X'QIX)IX'OQIY. 


I 


The estimator Ê is called the generalized least squares (GLS) estimator, 
and its properties are given by the following result. 


Theorem 1.3 The following are the “generalized” classical assumptions. 


(i) The data are generated as Y; = XB, +1, t=1,...,n, Bo ERE. 


(ii) X is a nonstochastic and finite n x k matriz, n > k. 


(iii*) Q is finite and positive definite, and X'QT!X is nonsingular. 


) 
) 
) 
(iv) Ele) = 
(v*) € ~ N(0,9). 
) 


(Existence) Given (i)-(iii*), 8) erists and is unique. 


(a 
(b) (Unbiasedness) Given (i)-(iv), E(B.) = Bo- 
(c) (Normality) Given (i){v*), Ê; ~ N(B., (X'Q7 X) !). 


(d) (Efficiency) Given (i){v*), B, is the maximum likelihood estimator 
and is the best unbiased estimator. 


Proof. Apply Theorem 1.1 with Y,* = X7’8,+e7. m 

If Q is known, we obtain efficiency by transforming the model “back” toa 
form in which OLS gives the efficient estimator. However, if 2 is unknown, 
this transformation is not immediately available. It might be possible to 
estimate Q, say by ©, but © is then random and so is the factorization 
C. Theorem 1.1 no longer applies. Nevertheless, it turns out that in large 
samples we can often proceed by replacing Q with a suitable estimator Q. 
We consider such situations in Chapter 4. 
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Hypothesis testing in the classical linear model relies heavily on being 
able to make use of the t- and F--distributions. However, it is quite possible 
that the normality assumption (v) or (v*) may fail. When this happens, 
the classical t- and F-statistics generally no longer have the t- and F- 
distributions. Nevertheless, the central limit theorem can be applied when n 
is large to guarantee that Bn or B, is distributed approximately as normal, 
as we shall see in Chapters 4 and 5. 

Now consider what happens when assumption (ii) fails, so that the ex- 
planatory variables X; are stochastic. In some cases, this causes no real 
problems because we can examine the properties of our estimators “condi- 
tional” on X. For example, consider the unbiasedness property. To demon- 
strate unbiasedness we use (i) to write 


Br — Bo A (X'X) t X'e. 


If X is random, we can no longer write E((X'X)~!X’e) = (X’X)-!X'E(e). 
However, by taking conditional expectations, we can treat X as “fixed,” so 
we have 


I 


E(6,|X) B, + E((X'X)~!X'e|X) 


Bo + (X'X)'X'E(e|X). 


If we are willing to assume E(e|X) = 0, then conditional unbiasedness 
follows, i.e., 


E(B,|X) = Bo 


Unconditional unbiasedness follows from this as a consequence of the law 
of iterated expectations (given in Chapter 3), i.e., 


E(B,) = E[E(B,|X)] = E(B.) = Bo. 


The other properties can be similarly considered. However, the assump- 
tion that E(e|X) = 0 is crucial. If E(e|X) 4 0, Bn need not be unbiased, 
either conditionally or unconditionally. 

Situations in which E(e|X) 4 0 can arise easily in economics. For exam- 
ple, X; may contain errors of measurement. Suppose the data are generated 
as 


Y, = Wi8, +, E(Wi%) = 0, 
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but we measure W, subject to errors 7,, as K; = Wi +m, E(Win},) = 0, 
E(n,.ni) # 0, E(w) = 0. Then 


Y: = X; b, +v — 7,8, = XiB, + €t. 


With e: = vt — Nibo, we have E(Xrez) = EI(W: + m) (vt — mBo) = 
—E(n.ni)B, 4 0. Now E(e|X) = 0 implies that for all t, E(Xzez) = 0, 
since E(Xze,) = E[E(X1e.|XK)] = E[X:E(e:|X)] = 0. Hence E(X:c:) # 
0 implies E(e|X) # 0. The OLS estimator will not be unbiased in the 
presence of measurement errors. 

As another example, consider the data generating process 


Y, = Yi-100+ W760 + €t, E(Wietr) = 0; 
Et = Po€t-1 tit, E(Er-12) = 0. 


This is the case of serially correlated errors in the presence of a lagged 
dependent variable Y;—ı. Let X: = (Y:-1, W})’ and B, = CATAE Again, 
we have 


Yı = X;bo + €t, 
but 
E(Xtet) = E((Y:-1, W;)et) = (E(Yi-142), 0)’. 
If we also assume E(¥;-1¥1) = 0, E(Y:-1€t-1) = E(Yier), and E(e?) = o2, 
it can be shown that 


2 
To oO 
BY aes) = 7 


Thus if p, # 0, then E(X:c:) 4 0 so that E(e|K) 4 0 and OLS is not 
generally unbiased. 
As a final example, consider a system of simultaneous equations 


Yn = Yirao+tWiy00teu, E(Wien) = 0, 
Y2 = Wiorot 22: E(Wi2é12) = 0. 


Suppose we are only interested in the first equation, but we know E(é¢1€12) 
= 012 # 0. Let Xa = (Yi2, W),)’ and B, = (a,,6,)’. The equation of 


oO 
interest 1s now 


Yu = Xa bo + En. 
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In this case E (Xnet) = E((Yi2, Wii Yen) = (E(Yi2€t1), 0)’. Now 
E(¥%2€11) = E((Wi2¥o + Et2)Et1) = E(€11€12) = 012 + 0, 


assuming E(Wz:2e:41) = 0. Thus F(Xi1€21) = (012,0)’ 4 0, so again OLS 
is not generally unbiased, either conditionally or unconditionally. 

Not only is the OLS estimator generally biased in these circumstances, 
but it can be shown under reasonable conditions that this bias does not 
get smaller as n gets larger. Fortunately, there is an alternative to least 
squares that is better behaved, at least in large samples. This alternative, 
first used by P. G. Wright (1928) and his son S. Wright (1925) and formally 
developed by Reiersgl (1941, 1945) and Geary (1949), exploits the fact that 
even when E'(X;e€;) £ 0, it is often possible to use economic theory to find 
other variables that are uncorrelated with the errors e+. Without such vari- 
ables, correlations between the observables and unobservables (the errors 
Et) persistently contaminate our estimators, making it impossible to learn 
anything about 3,. Hence, these variables are instrumental in allowing us 
to estimate @,, and we shall denote these “instrumental variables” as an 
l x 1 vector Z. The n x l matrix Z has rows Zi. 

To be useful, the instrumental variables must also be closely enough 
related to X; so that Z’X has full column rank. If we know from eco- 
nomic theory that E(X;e;) = 0, then X; can serve directly as the set of 
instrumental variables. As we saw previously, X; may be correlated with 
€ SO we cannot always choose Z; = X,. Nevertheless, in each of those 
examples, the structure of the data generating process suggests some rea- 
sonable choices for Z. In the case of errors of measurement, a useful set 
of instrumental variables would be another set of measurements on W; 
subject to errors €, uncorrelated with 7, and v, say Ze = W: + &;. 
Then E(Zze.) = E[(We + &2)(v% — {B8,)] = 0. In the case of serial 
correlation in the presence of lagged dependent variables, a useful choice 
is Z, = (Wi, 1)’ provided E(W;_1€;) = 0, which is not unreasonable. 
Note that the inion Yi-1 = Yi-2Q0 + W;_160 + €t-1 ensures that W: 
will be related to Y;_;. In the case of simultaneous equations, a useful 
choice is Z% = (W/,, Wi2)’. The relation Y;2 = Woo + Ezt ensures that 
W+ will be related to Yi. 

In what follows, we shall simply assume that such instrumental variables 
are available. However, in Chapter 4 we shall be able to specify precisely 
how best to choose the instrumental variables. 

Earlier, we stated the important fact that most econometric estimators 
can be viewed as solutions to an optimization problem. In the present con- 
text, the zero correlation property E(Z,e,) = 0 provides the fundamental 
basis for estimating B,- Because e+ = Y; — X{(,, Bo is a solution of the 
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equations F(Z:(¥; — X;G,)) = 0. However, we usually do not know the 
expectations E(Z,Y;) and E(Z:X; ) needed to find a solution to these equa- 
tions, so we replace expectations with sample averages, which we hope will 
provide a close enough approximation. Thus, consider finding a solution to 
the equations 


n7! $O Z(Y: — X1B,) = Z'(Y —XB,)/n = 0. 
t=1 


This is a system of l equations in k unknowns. If l < k, there is a multiplicity 
of solutions; if l = k, the unique solution is Bn = (Z'X)-!Z'Y, provided 
that Z’X is nonsingular; and if l > k, these equations need have no solution, 
although there may be a value for G that makes Z'(Y — X8) “closest” to 
zero. 

This provides the basis for solving an optimization problem. Because eco- 
nomic theory typically leads to situations in which l > k, we can estimate 
6B, by finding that value of 8 that minimizes the quadratic distance from 
zero of Z'(Y — XB), 


dn (B) = (Y — XBY ZÊ,„Z' (Y ~ XB), 


where P,, is a symmetric l x l positive definite norming matrix which may 
be stochastic. For now, P,, can be any symmetric positive definite matrix. 
In Chapter 4 we shall see how the choice of Ê, affects the properties of our 
estimator and how Ê, can best be chosen. 

We choose the quadratic distance measure because the minimization 
problem “minimize d,() with respect to 8” has a convenient linear solu- 
tion and yields many well-known econometric estimators. Other distance 
measures yield other families of estimators that we will not consider here. 

The first-order conditions for a minimum are 


Od, (B)/08 = —2X'ZÊ,„Z' (Y — XB) = 0. 


Provided that X’ZP,,Z’X is nonsingular (for which it is necessary that Z/X 
have full column rank), the resulting solution is the instrumental variables 
(IV) estimator (also known as the “method of moments” estimator) 


B,, = (X'ZP,,Z’X)7!X'ZP,,7Z’Y. 


All of the estimators considered in this book have this form, and by choosing 
Z or P,, appropriately, we can obtain a large number of the estimators of in- 
terest to econometricians. For example, with Z = X and P, = (X’X/n)7! 


t 
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B,, = Bn; that is, the IV estimator equals the OLS estimator. Given any 
Z, choosing Ê„ = (Z'Z/n)~! gives an estimator known as two-stage least 
squares (2SLS). The tools developed in the following chapters will allow us 
to pick Z and P,, in ways appropriate to many of the situations encountered 
in economics. N 

Now consider the problem of determining whether B, is unbiased. If the 
data generating process is Y = Xf, + €, we have 


B, = (X'ZP,Z'X)-!X'ZP,Z/Y 
= (X’ZP,Z'X)'X’ZP,Z'(X{, +€) 
Bo + (X'ZPnZ’X)*X’ZPpZ’e, 
so that 
E(B,,) = Bo + E((X'ZÊnZ' X) X'ZÊ,Z'e). 


In general, it is not possible to guarantee that the second term above 
vanishes, even when E(e|Z) = 0. In fact, the expectation in the second 
term above may not even be defined. For this reason, the concept of unbi- 
asedness is not particularly relevant to the study of IV estimators. Instead, 
we shall make use of the weaker concept of consistency. Loosely speaking, 
an estimator is “consistent” for G, if it gets closer and closer to B, as n 
grows. In Chapters 2 and 3 we make this concept precise and explore the 
consistency properties of OLS and IV estimators. For the examples above 
in which E(e|X) + 0, it turns out that OLS is not consistent, whereas 
consistent IV estimators are available under general conditions. 

Although we only consider linear stochastic relationships in this book, 
this still covers a wide range of situations. For example, suppose we have 
several equations that describe demand for a group of p commodities: 


Yn = X86, +11 
Y2 = Xb+ €x2 
Yip = Xi bp + Exp, 2 nee fe 


Now let Y,, be a p x 1 vector, Y; = (Yi, Yi2,..- , Yep)’, let ef = (en, 
Et2;. , Etp)» let B, = (B1 Bo. sa (35) and let 


Xa 0 `. O 
0 Xi 0 


zí 
l 


0 O = Xp. 
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Now X; is a k x p matrix, where k = 57)_, ki and Xj; is a k; x 1 vector. 
The system of equations can be written as 


/ 


Yn Xa Qo er O By, Eu 
Yi2 0 Xg 0 By Et2 
Wet ee a. ft. 
Ys 0 0 > Xe Bp _ Etp | 


Or 
Y; = X;bo + Ez. 


Letting Y = (Yi, Y$...,Y4)’, X = (Xi, Xe,..., Xn)’, and € = (ej, 
E5,... ,€,,)’, we can write this system as 


Y = XG, +e. 


Now Y is pn x 1, € is pn x 1, and X is pn x k. This allows us to consider 
simultaneous systems of equations in the present framework. 

Alternatively, suppose that we have observations on an individual ¢ in 
each of p time periods, 


Ya = XB, + en, 
Y = Xab + Et2, 
Yip = ippo + Eip, oom UN ‘n. 


Define Y; and e; as above, and let 
X: = (Kei, X:2,... , Xp] 
be a k x p matrix. The observations can now be written as 
Yı = X16, + Et, 


or equivalently as 


Y = XB, +e, 


with Y, X, and e as defined above. This allows us to consider panel data 
in the present framework. Further, by adopting appropriate definitions, 
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the case of simultaneous systems of equations for panel data can also be 
considered. 

Recall that the GLS estimator was obtained by considering a linear 
transformation of a linear stochastic relationship, i.e., 


yko X*B, et, 


where Y* = C7lY, X* = C7!X, and e* = Ce for some nonsingular 
matrix C. It follows that any such linear transformation can be considered 
within the present framework. 

The reason for restricting our attention to linear models and IV estima- 
tors is to provide clear motivation for the concepts and techniques intro- 
duced while also maintaining a relatively simple focus for the discussion. 
Nevertheless, the tools presented have a much wider applicability and are 
directly relevant to many other models and estimation techniques. 
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CHAPTER 2 


Consistency 


In this chapter we introduce the concepts needed to analyze the behavior 
of G,, and B, as n > oo. 


2.1 Limits 


The most fundamental concept is that of a limit. 


Definition 2.1 Let {bn} be a sequence of real numbers. If there exists a 
real number b and if for every real 6 > O there exists an integer N(6) such 
that for alln > N(6), |bn — b| < 6, then b is the limit of the sequence {bn}. 


In this definition the constant 6 can take on any real value, but it is 
the very small values of 6 that provide the definition with its impact. By 
choosing a very small 6, we ensure that bn gets arbitrarily close to its limit 
b for all n that are sufficiently large. When a limit exists, we say that the 
sequence {bn} converges to b as n tends to infinity, written bn — b as 
n — oo. We also write b = limn_,o. bn. When no ambiguity is possible, we 
simply write b, — b or b = lim bn. If for any a € R, there exists an integer 
N(a) such that bn > a for all n > N(a), we write bn — oo and we write 
bn — —oo if — bn — C. 


15 
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Example 2.2 (i) Let b, = 1-—1/n. Then b, — 1. (ii) Let bn = (1+a/n)”. 
Then bn — e°. (iii) Let bn = n?. Then bn -> ov. (iv) Let bn = (—1)”. Then 
no limit ezists. 


The concept of a limit extends directly to sequences of real vectors. Let 
b, be a k x 1 vector with real elements b,;, i = 1,... ,k. If bni -> bi, 
i = 1,...,k, then bn — b, where b has elements b;, i = 1,...,k. An 
analogous extension applies to matrices. 

Often we wish to consider the limit of a continuous function of a sequence. 
For this, either of the following equivalent definitions of continuity suffices. 


Definition 2.3 Given g : R* — R! (k,l € N) and b € RF, (i) the function 
g is continuous at b if for any sequence {bn} such that bn -> b, g(bn) -> 
g(b); or equivalently (ii) the function g is continuous at b if for every 
e€ > 0 there erists 6(€) > O such that if a € RE and |a; — b;| < &(e), 
i = 1,... k, then |g;(a) — g;(b)| < €, j = 1,... ,l. Further, if B C RF, 
then g is continuous on B if it is continuous at every point of B. 


Example 2.4 (i) From this it follows that if an -> a and bn — b, then 
an + bn — a +b and anb’, — ab’. (ii) The matriz inverse function is 
continuous at every point that represents a nonsingular matriz, so that if 
X'’X/n -> M, a finite nonsingular matriz, then (X'X/n)—! — M™:. 


Often it is useful to have a measure of the order of magnitude of a 
particular sequence without particularly worrying about its convergence. 
The following definition compares the behavior of a sequence {bn} with the 
behavior of a power of n, say nò, where À is chosen so that {bn} and {n*} 
behave similarly. 


Definition 2.5 (i) The sequence {b,} is at most of order n*, denoted 
bn = O(n), if for some finite real number A > 0, there exists a finite 
integer N such that for alln > N, |n~b,| < A. (ii) The sequence {bn} is 
of order smaller than nò, denoted b, = o(n*), if for every real number 6 > 0 
there exists a finite integer N(6) such that for all n > N(6), |n~by| < ô, 
ie., nb, — 0. 


In this definition we adopt a convention that we utilize repeatedly in the 
material to follow; specifically, we let A represent a real positive constant 
that we may take to be as large as necessary, and we let 6 (and similarly 
€) represent a real positive constant that we may take to be as small as 
necessary. In any two different places A (or 6) need not represent the same 
value, although there is no loss of generality in supposing that it does. 
(Why?) 
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As we have defined these notions, bn = O(n*) if {n~*b,} is eventually 
bounded, whereas bn = o(n*) if n~*b, — 0. Obviously, if bn = o(n>), then 
bn = O(n»). Further, if bn = O(n), then for every 6 > 0, bn = O(n**°), 
When bn = O(n°), it is simply (eventually) bounded and may or may not 
have a limit. We often write O(1) in place of O(n°). Similarly, bn = 0(1) 
means b,, — 0. 


Example 2.6 (i) Let bn = 4+ 2n+6n?. Then bnp = O(n?) and b, = 
o(n2t*) for every 6 > 0. (ii) Let bn = (—1)”. Then bn = O(1) and bp, = 
O(n°) for every 6 > 0. (iii) Let bn = exp(—n). Then bn = o(n-°) for 
every 6 > 0 and bn = O(n-°) for every 6 > 0. (iv) Let bn = exp(n). Then 
bn #4 O(n") for anyk ER. 


If each element of a vector or matrix is O(n») or o(n*), then that vector 
or matrix is O(n?) or o(n%). 

Some elementary facts about the orders of magnitude of sums and prod- 
ucts of sequences are given by the next result. 


Proposition 2.7 Let a, and by, be scalars. (i) If a, = O(n) and bn = 
O(n"), then anbn = O(n*t#) and an + bn = O(n"), where k = max(A, x]. 
(ii) If an = o(nò) and by = o(n#), then anbn = o(n*t¥#) and ay + bn = 
o(n*). (iii) If an = O(n) and bn = o(n*), then anbn = o(n*t#) and 
an + bn = O(n"). 


Proof. (i) Since an = O(n) and b, = O(n*), there exist a finite A > 0 and 
N € N such that, for all n > N, |In-*a,,| < A and |n~“b,| < A. Consider 
anbn. Now oe n| = Raab) = |nan]: [nbn] < A? for 
alln > N. Hence anbn = O(nòt#). Consider an +bn. Now |n" (an +bn)| = 
|n antn "bn] < Inan tn- bn] by the triangle inequality. Since k > A 
and K > p, |n" (an + bn)| < |nan] + |n="bn] < Inan] + |n-#b,| < 2A 
for all n > N. Hence an + bn = O(n"), k = maxfà, y]. 

(ii) The proof is identical to that of (i), replacing A with every ô > 0 
and N with N(6). 

(iii) Since an = O(n*) there exist a finite A > 0 and N’ € N such that 
for all n > N’, |n~a,| < A. Given 6 > 0, let 6” = 6/A. Then since 
bn = o(n!) there exists N” (6”) such that |[n-#b,| < 6” for n > N”(6”). 
Now |n-*-Hanbn| = [n ^ann bn] = |n~a,| - nbn] < Ab” = 6 for 
n > N = max(N’, N”(6)). Hence anbn = o(n*t#). Since bn = o(n*), it is 
also O(n"). That an + bn = O(n“) follows from (i). m 


A particularly important special case is illustrated by the following ex- 
ercise. 
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Exercise 2.8 Let A, be ak x k matriz and let b, be ak x 1 vector. If 
An = 0(1) and bn = O(1), verify that A, by, = o(1). 


For the most part, econometrics is concerned not simply with sequences 
of real numbers, but rather with sequences of real-valued random scalars or 
vectors. Very often these are either averages, for example, Z,, = yo Zin, 
or functions of averages, such as A where {Z;} is, for example, a sequence 
of random scalars. Since the Z;,’s are random variables, we have to allow for 
a possibility that would not otherwise occur, that is, that different realiza- 
tions of the sequence { Z;} can lead to different limits for Z,,. Convergence 
to a particular value must now be considered as a random event and our in- 
terest centers on cases in which nonconvergence occurs only rarely in some 
appropriately defined sense. 


2.2 Almost Sure Convergence 


The stochastic convergence concept most closely related to the limit notions 
previously discussed is that of almost sure convergence. Sequences that 
converge almost surely can be manipulated in almost exactly the same 
ways as nonrandom sequences. 

Random variables are best viewed as functions from an underlying space 
Q to the real line. Thus, when discussing a real-valued random variable by, 
we are in fact talking about a mapping bn : Q — R. We let w be a typical 
element of Q and call the real number b,(w) a realization of the random 
variable. Subsets of Q, for example {w E Q : b,(w) < a}, are events and 
we will assign a probability to these, e.g., P{w E€ 2: b,(w) < a}. We write 
Plbn < a] as a shorthand notation. There are additional details that we 
will consider more carefully in subsequent chapters, but this understanding 
will suffice for now. 

Interest will often center on averages such as 


bal) =n? S ZC). 


We write the parentheses with dummy argument (-) to emphasize that bn 
and Z; are functions. 


Definition 2.9 Let {b,(-)} be a sequence of real-valued random variables. 


We say that bn (-) converges almost surely to b, written ba(-) “> b if there 
erists a real number b such that P{w : b,(w) > b} = 1. 
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The probability measure P determines the joint distribution of the entire 
sequence { Z,}. A sequence b, converges almost surely if the probability of 
obtaining a realization of the sequence {Z,} for which convergence to b 
occurs is unity. Equivalently, the probability of observing a realization of 
{Z,} for which convergence to b does not occur is zero. Failure to converge 
is possible but will almost never happen under this definition. Obviously, 
then, nonstochastic convergence implies almost sure convergence. 

Because the set of w’s for which b,(w) — b has probability one, bn is 
sometimes said to converge to b with probability 1, (w.p.1). Other common 
terminology is that b, converges almost everywhere (a.e.) in Q or that bn 
is strongly consistent for b. When no ambiguity is possible, we drop the 
notation (-) and simply write bn “> b instead of b,(-) 25 b. 


Example 2.10 Let Za =n Yi Zi, where {Z} is a sequence of inde- 
pendent identically distributed (i.i.d.) random variables with p = E(Z,) < 
oo. Then Z„ 5 p, by the Kolmogorov strong law of large numbers ( Theo- 
rem 8.1). 


The almost sure convergence of the sample mean illustrated by this ex- 
ample occurs under a wide variety of conditions on the sequence {Z;}. A 
discussion of these conditions is the subject of the next chapter. 

As with nonstochastic limits, the almost sure convergence concept ex- 
tends immediately to vectors and matrices of finite dimension. Almost sure 
convergence element by element suffices for almost sure convergence of vec- 
tors and matrices. 

The behavior of continuous functions of almost surely convergent se- 
quences is analogous to the nonstochastic case. 


Proposition 2.11 Given g : Rë — R! (k,l € N) and any sequence of 
random k x 1 vectors {bn} such that bn 5 b, where b is k x 1, if g is 
continuous at b, then g(bn) > g(b). 


Proof. Since b,(w) — b implies g(b,(w)) — g(b), 
{w : bn(w) > b} C {w : g(bn(w)) > g(b)}. 
Hence 
1 = P{w : ba(w) > b} < P{w : g(bn(w)) > g(b)} < 1, 
so that g(b,) “> g(b). m 
This result is one of the most important in this book, because consistency 


results for many of our estimators follow by simply applying Proposition 
2.11. 
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Theorem 2.12 Suppose 
GO) Ve = RUB Hei. t= 1, 0s BER": 
(ii) X’e/n = 0; 
(iii) X'X/n = M, finite and positive definite. 
Then Bn exists for all n sufficiently large a.s., and B,, 2 Bo- 


Proof. Since X'X/n %5 M, it follows from Proposition 2.11 that 
det(X’X/n) 45 det(M). 


Because M is positive definite by (iii), det(M) > 0. It follows that for 
all n sufficiently large det(X’X/n) > 0 a.s., so (X’X/n)—! exists for all n 
sufficiently large a.s. Hence B, = Bo+ (X'X/n)~!X’e/n exists for all n 
sufficiently large a.s. 

Now B,, = Bo + (X'X/n)-1X'e/n by (i). It follows from Proposition 
2.11 that 8B, “5 B, + M7! -0 = B,, given (ii) and (iii). = 

In the proof, we refer to events that occur a.s. Any event that occurs 
with probability one is said to occur almost surely (a.s.) (e.g., convergence 
to a limit or existence of the inverse). 

Theorem 2.12 is a fundamental consistency result for least squares es- 
timation in many commonly encountered situations. Whether this result 
applies in a given situation depends on the nature of the data. For ex- 
ample, if our observations are randomly drawn from a population, as in a 
pure cross section, they may be taken to be i.i.d. The conditions of The- 
orem 2.12 hold for i.i.d. observations provided E(X:X;) = M, finite and 
positive definite, and E(X,é;) = 0, since Kolmogorov’s strong law of large 
numbers (Example 2.10) ensures that X'X/n = n! S77, X:X; 55 M 
and X’e/n = n~!S~7_, X: => 0. If the observations are dependent (as 
in a time series), different laws of large numbers must be applied to guar- 
antee that the appropriate conditions hold. These are given in the next 
chapter. 

A result for the IV estimator can be proven analogously. 


Exercise 2.13 Prove the following result. Suppose 
(i) Y; = X{botetn t=1,2,..., Bo € RF; 
(ii) Z'e/n = 0; 
(iii) (a) Z'X/n => Q, finite with full column rank; 
(b) P,, &5 P, finite and positive definite. 
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Then 3,, exists for all n sufficiently large a.s., and B,, "> B,. 


This consistency result for the IV estimator precisely specifies the con- 
ditions that must be satisfied for a sequence of random vectors {Z+} to act 
as a set of instrumental variables. They must be unrelated to the errors, as 
specified by assumption (ii), and they must be closely enough related to the 
explanatory variables that Z’X/n converges to a matrix with full column 
rank, as required by assumption (iii.a). Note that a necessary condition for 
this is that the order condition for identification holds (see Fisher, 1966, 
Chapter 2); that is, that | > k. (Recall that Z is pn x l and X is pn x k.) 
For now, we simply treat the instrumental variables as given. In Chapter 4 
we see how the instrumental variables may be chosen optimally. 

A potentially restrictive aspect of the consistency results just given for 
the least squares and IV estimators is that the matrices X'X/n, Z'X/n, 
and P,, are each required to converge to a fixed limiting value. When the 
observations are not identically distributed (as in a stratified cross section, 
a panel, or certain time-series cases), these matrices need not converge, and 
the results of Theorem 2.12 and Exercise 2.13 do not necessarily apply. 

Nevertheless, it is possible to obtain more general versions of these results 
that do not require the convergence of X’/X/n, Z’X/n, or P,, by general- 
izing Proposition 2.11. To do this we make use of the notion of uniform 
continuity. 


Definition 2.14 Given g : R* — R! (k,l € N), we say that g is uniformly 
continuous on a set B C R* if for eache > 0 there is a 6(€) > 0 such that if 
a and b belong to B and |a;—b;| < d(€),i=1,... ,k, then |g,(a)—g;(b)| < 
Sar S= asl 


Note that uniform continuity implies continuity on B but that continuity 
on B does not imply uniform continuity. The essential aspect of uniform 
continuity that distinguishes it from continuity is that 6 depends only on € 
and not on b. However, when B is compact, continuity does imply uniform 
continuity, as formally stated in the next result. 


Theorem 2.15 (Uniform continuity theorem) Suppose g : RE — R! 
is a continuous function on C C R*. If C is compact, then g is uniformly 
continuous on C. 


Proof. See Bartle (1976, p. 160). = 
Now we extend Proposition 2.11 to cover situations where a random 
sequence {bn} does not necessarily converge to a fixed point but instead 


e a.s. 
“follows” a nonrandom sequence {cn}, in the sense that bn — cn —> 0, 
where the sequence of real numbers {cn } does not necessarily converge. 
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Proposition 2.16 Let g : R — R! be continuous on a compact set C C 
R*. Suppose that {bn} is a sequence of random k x 1 vectors and {cn} is a 
sequence of k x 1 vectors such that b,(-) — cn 5 0 and there exists n > 0 
such that for all n sufficiently large {c : |ci — Cri] < 4,1 =1,...,k} CC, 
i.e., for all n sufficiently large, Cn is interior to C uniformly in n. Then 
g(bn(-)) — glen) 5 0. 


Proof. Let g; be the jth element of g. Since Č is compact, g; is uniformly 
continuous on C by Theorem 2.15. Let F = {w : b (w) — cn — 0}; then 
P(F) = 1 since b, — cn 5 0. Choose w € F. Since cy is interior to C 
for all n sufficiently large uniformly in n and b p(w) —c, — 0, b,(w) is 
also interior to Č for all n sufficiently large. By uniform continuity, for any 
€ > 0 there exists (e) > 0 such that if |b,;(w) — cy;| < 6(€), i =1,... ,k, 
then |g;(bn(w)) — g;(en)| < £. Hence g(b,(w)) — g(cn) — 0. Since this is 
true for any w € F and P(F) = 1, then g(b,) — g(cn) —> 0. m 


To state the results for the OLS and IV estimators below concisely, we 
define the following concepts, as given by White (1982, pp. 484-485). 


Definition 2.17 A sequence of kxk matrices {An} is said to be uniformly 
nonsingular if for some 6 > 0 and all n sufficiently large |det(A,)| > 
6. If {An} is a sequence of positive semidefinite matrices, then {An} is 
uniformly positive definite if {An} is uniformly nonsingular. If {An} is 
a sequence of | x k matrices, then {An} has uniformly full column rank 
if there exists a sequence of k x k submatrices {A*} which is uniformly 
nonsingular. 


If a sequence of matrices is uniformly nonsingular, the elements of the 
sequence are prevented from getting “too close” to singularity. Similarly, if 
a sequence of matrices has uniformly full column rank, the elements of the 
sequence are prevented from getting “too close” to a matrix with less than 
full column rank. 

Next we state the desired extensions of Theorem 2.12 and Exercise 2.13. 


Theorem 2.18 Suppose 
() Via Xa +e. t=1,2,..., Bo E€ RF; 
(ii) X'e/n 45 0; 


(iii) X'X/n - Mn, => 0, where M, = O(1) and is uniformly positive 
definite. 


Then B, exists for all n sufficiently large a.s., and B,, &5 B,. 
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Proof. Because M„, = O(1), it is bounded for all n sufficiently large, 
and it follows from Proposition 2.16 that det(X’X/n) — det(Mn) “> 0. 
Since det(M,,) > 6 > 0 for all n sufficiently large by Definition 2.17, 
it follows that det(X’X/n) > 6/2 > 0 for all n sufficiently large a.s., 
so that (X'X/n)—! exists for all n sufficiently large a.s. Hence B,, = 
(X’'X/n)-!X’Y/n exists for all n sufficiently large a.s. 

Now Bn = B, + (X'K/n)—X’e/n by (i). It follows from Proposition 
2.16 that B,, — (B, +M;! -0) “5 0 or Ô, =S B,, given (iz) and (iii). m 

Compared with Theorem 2.12, the present result relaxes the requirement 
that X'X/n “> M and instead requires that X’X/n—M, 45 0, allowing 
for the possibility that X’X/n may not converge to a fixed limit. Note that 
the requirement det(M,,) > 6 > 0 ensures the uniform continuity of the 
matrix inverse function. 

The proof of the IV result requires a demonstration that {Q),P,Q,} is 
uniformly positive definite under appropriate conditions. These conditions 
are provided by the following result. 


Lemma 2.19 If {A,,} is a O(1) sequence of | x k matrices with uniformly 
full column rank and {Bn} is a O(1) sequence of uniformly positive definite 
l x 1 matrices, then {A/,B,A,} and {A/,B>!A,,} are O(1) sequences of 
uniformly positive definite k x k matrices. 


Proof. See White (1982, Lemma A.3). m 


Exercise 2.20 Prove the following result. Suppose 
(i) Y: = X;bot Et, A ges E E 
(ii) Z'e/n = 0; 
(iii) (a) Z'X/n - Qn 45 0, where Q, = O(1) and has uniformly full 
column rank; 
(b) P,—P,, “> 0, where Pn = O(1) and is symmetric and uniformly 
positive definite. 


Then Bn ezists for all n sufficiently large a.s., and B, 23, B.. 


The notion of orders of magnitude extends to almost surely convergent 
sequences in a straightforward way. 


Definition 2.21 (i) The random sequence {bn} is at most of order nô 


almost surely denoted bn = Oa.s.(nò), if there erist A < œ and N < œ 
such that P[|n-*b,| < A for all n > N] = 1. (ii) The sequence {bn} is of 
order smaller than nô almost surely denoted bn = oa. s. (nò) if nòb, —> 0. 
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A sufficient condition that bn = Og.5.(n~%) is that n-*b, — an = 0, 
where an = O(1). The algebra of Oa.s. and Oa.s. is analogous to that for O 
and o. 


Exercise 2.22 Prove the following. Let ay and bn be random scalars. (i) If 
an = Oa.s.(n*) and bn = Oa.s.(n), then anbn = Oa.s.(n*t#) and {an +bn} 
is Og.s.(n”), k = max[A, u]. (ii) If an = 0a.5.(n*) and bn = Oa.5,(n"), then 
anbn = Oa.s.(NnòtH) and an + bn = 0a.5,(n"). (iii) If an = Oa.s.(nò) and 
bn = Oas. (n), then nbn = 0a.s.(n*t#) and {an + bn} is Oa.s.(n*). 


2.3 Convergence in Probability 


A weaker stochastic convergence concept is that of convergence in proba- 
bility. 


Definition 2.23 Let {b,,} be a sequence of real-valued random variables. 
If there exists a real number b such that for every € > 0, P(w : |bn(w)—b| < 


€) +1 as n —> œœ, then b, converges in probability to b, written bn Pah. 


With almost sure convergence, the probability measure P takes into ac- 
count the joint distribution of the entire sequence {Z;}, but with conver- 
gence in probability, we only need concern ourselves sequentially with the 
joint distribution of the elements of {Z;} that actually appear in bn, typi- 
cally the first n. When a sequence converges in probability, it becomes less 
and less likely that an element of the sequence lies beyond any specified 
distance € from b as n increases. The constant b is called the probability 
limit of ba. A common notation is plim b, = b. 

Convergence in probability is also referred to as weak consistency, and 
since this has been the most familiar stochastic convergence concept in 
econometrics, the word “weak” is often simply dropped. The relationship 
between convergence in probability and almost sure convergence is specified 
by the following result. 


Theorem 2.24 Let {b,} be a sequence of random variables. If b, “> b, 
then bn > b. If ba —> b, then there exists a subsequence {bn,} such that 


bn; Z5 b. 


Proof. See Lukacs (1975, p. 480). m 

Thus, almost sure convergence implies convergence in probability, but the 
converse does not hold. Nevertheless, a sequence that converges in probabil- 
ity always contains a subsequence that converges almost surely. Essentially, 
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convergence in probability allows more erratic behavior in the converging 
sequence than almost sure convergence, and by simply disregarding the er- 
ratic elements of the sequence we can obtain an almost surely convergent 
subsequence. For an example of a sequence that converges in probability 
but not almost surely, see Lukacs (1975, pp. 34-35). 


Example 2.25 Let Zn = es Pa Zı, where {Z+} is a sequence of ran- 
dom variables such that E(Z:ı) = p, var(Z,) = 0? < œ for allt and 
cov(Z:, Z+) = 0 fort # T. Then Z, > u by the Chebyshev weak law of 
large numbers, (see Rao, 1973, p. 112). 


Note that, in contrast to Example 2.10, the random variables here are not 
assumed either to be independent (simply uncorrelated) or identically dis- 
tributed (except for having identical mean and variance). However, second 
moments are restricted by the present result, whereas they are completely 
unrestricted in Example 2.10. 

Note also that, under the conditions of Example 2.10, convergence in 
probability follows immediately from the almost sure convergence. In gen- 
eral, most weak consistency results have strong consistency analogs that 
hold under identical or closely related conditions. For example, strong con- 
sistency also obtains under the conditions of Example 2.25. These analogs 
typically require somewhat more sophisticated techniques for their proof. 

Vectors and matrices are said to converge in probability provided each 
element converges in probability. 

To show that continuous functions of weakly consistent sequences con- 
verge to the functions evaluated at the probability limit, we use the follow- 
ing result. 


Proposition 2.26 (The implication rule) Consider events E and F;, 
i=1,...,k, such that (Ms F;) C E. Then S P(F£) > P(E‘). 


Proof. See Lukacs (1975, p. 7). m 


Proposition 2.27 Given g : Rë — R! and any sequence {b,} ofk x 1 
random vectors such that b,, , b, where b is a k x 1 vector, if g is 
continuous at b, then g(b,) — g(b). 


Proof. Let gj be an element of g. For every e > 0, the continuity of 
g implies that there exists 6(¢) > 0 such that if |bni(w) — b;| < &(e), 
i = 1,...,k, then |g;(bn(w)) — 9;(b)| < £. Define the events Fy; = {w : 
lbni(w) — b:| < 6(e)} and En = {w : |g;(bn(w)) — g;(b)| < e}. Then 
(Ma Fni) C En. By the implication rule, DS P(FS;) > P(E‘). Since 
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b„ -> b, for arbitrary 7 > 0, and all n sufficiently large, P(F‘;) < 
Hence P(E) < kn, or P(En) > 1— kn. Since P(En) < 1 and 77 is ee 
P(En) > 1 as n > œ, hence g;(b,) > g;(b). Since this holds for all 
GS 1h esl, eba) ae g(b). m 

This result allows us to establish direct analogs of Theorem 2.12 and 
Exercise 2.13. 


Theorem 2.28 Suppose 
(i) Yı = X; boten t=1,2,..., 8, € R5; 
(ii) X'e/n 4 0; 
(iii) X'X/n = M, finite and positive definite. 
Then B,, ezists in probability, and B,, <> Bo- 


Proof. The proof is identical to that of Theorem 2.12 except that Proposi- 
tion 2.27 is used instead of Proposition 2.11 and convergence in probability 
replaces convergence almost surely. @ 


The statement that Bn “exists in probability” is understood to mean that 
there exists a subsequence {Â ,} such that B,,. exists for all n; j sufficiently 
large a.s., by Theorem 2.24. In other words, x! X/n can converge to M 
in such a way that X’X/n does not have an inverse for each n, so that 
Bn may fail to exist for particular values of n. However, a subsequence of 
{X'X/n} converges almost surely, and for that subsequence, Bn, will exist 
for all n; sufficiently large, almost surely. 


Exercise 2.29 Prove the following result. Suppose 
(i) Y: =X,6,+er, t=1,2..., B, €R; 
(ii) Z'e/n = 0; 
(iii) (a) Z'X/n 4 Q, finite with full column rank; 
(b) Ên 4 P, finite, symmetric, and positive definite. 
Then Br exists in probability, and Ba , Be 


Whether or not these results apply in particular situations depends on 
the nature of the data. As we mentioned before, for certain kinds of data it 
is restrictive to assume that X’X/n, Z'X/n, and P,, converge to constant 
limits. We can relax this restriction by using an analog of Proposition 2.16. 
This result is also used heavily in later chapters. 
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Proposition 2.30 Let g : R — R! be continuous on a compact set C C 
R*. Suppose that {b,} is a sequence of random k x 1 vectors and {cn} is a 
sequence of k x 1 vectors such that bn — cn —> 0, and for alln sufficiently 
large, €n is interior to C, uniformly inn. Then g(bn) — g(en) = 0. 


Proof. Let gj be an element of g. Since C is compact, gj is uniformly 
continuous by Theorem 2.15, so that for every £ > 0 there exists 6(€) > 0 
such that if |bni — Cnil < éle), i = 1,...,k, then |g;(bn) — gj(en)| < 
e. Define the events Fmi = {w : |bni(w) — cni| < 6(€)} and En = {w : 


lgj(bn(w)) — 9;(en)| < e}. Then (a Fri) C En. By the implication 


rule, Be P(FS,) > P(ES). Since b, — cn — 0, for arbitrary 7 > 0 
and all n sufficiently large, P(FS;) < n. Hence P(ES) < kn, or P(E,) > 
1 — kn. Since P(E,,) < 1 and 77 is arbitrary, P(E,) — 1 as n — oo, hence 
9j(bn)~—gj(en) > 0. As this holds for all j = 1,... ‚l, g(bn)—g(en) > 0. 
= 


Theorem 2.31 Suppose 
(i) Y; = XB ey. t=1,2,..., B, € R5; 
(ii) X'e/n 5 0; 


(iii) X'X/n —M, —> 0, where M, = O(1) and is uniformly positive 
de finite. 


Then f,, exists in probability, and Bn zi Boz 


Proof. The proof is identical to that of Theorem 2.18 except that Proposi- 
tion 2.30 is used instead of Proposition 2.16 and convergence in probability 
replaces convergence almost surely. @ 


Exercise 2.32 Prove the following result. Suppose 
@) Vi] X66 46s: $= 7, 220.58, 6 R 
(ii) Z'e/n = 0; 
(iii) (a) Z/X/n-—Q, > 0, where Qn = O(1) and has uniformly full 
column rank; 
(b) P,P, > 0, where P„ = O(1) and is symmetric and uniformly 
positive definite. 


Then Bn exists in probability, and B, —> B,. 
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As with convergence almost surely, the notion of orders of magnitude 
extends directly to convergence in probability. 


Definition 2.33 (i) The sequence {b,} is at most of order nò in proba- 
bility, denoted bn = O,(n*), if for every € > 0 there exist a finite A; > 0 
and N: € N, such that P{w : |n~b;(w)| > Ae} < e for all n > Ne. 
(ii) The sequence {bn} is of order smaller than nò in probability, denoted 
bn = 0p(n*), if nòb, > 0. 


When bn = O,(1), we say {bn} is bounded in probability and when bn = 
0p(1) we have b, > 0. 


Example 2.34 Let bn = Zn, where {Z:} is a sequence of identically dis- 
tributed N(0,1) random variables. Then P(w : |bn(w)| > A) = P(|Z,| > 
A) = 20(—A) for alln > 1, where ® is the standard normal cumula- 
tive distribution function (c.d.f.). By making A sufficiently large, we have 
2@(—A) < 6 for arbitrary 6 > 0. Hence, bn = Zn = O,(1). 


Note that ® in this example can be replaced by any c.d.f. F and the 
result still holds, i.e., any random variable Z with c.d.f. F is Op(1). 


Exercise 2.35 Prove the following. Let an and b, be random scalars. (i) 
If Qn = Op(n) and bn = Op(n”), then anbn = Op(n*t") and an + bn = 
O,(n"), k = max(A, u). (ii) If an = 0p(n*) and b, = 0,(n"), then anbn = 
Op(n**") and an +bn = op(n*). (iii) If an = Op(n*) and bn = 0,(n"), then 
Anbn = Op(n*t#) and an + bn = Op(n"). (Hint: Apply Proposition 2.30.) 


One of the most useful results in this chapter is the following corollary 
to this exercise, which is applied frequently in obtaining the asymptotic 
normality results of Chapter 4. 


Corollary 2.36 (Product rule) Let A, bel xk and let b„ be k x 1. If 
An = 0p(1) and bn = Op(1), then Anbn = 0p(1). 


Proof. Let an = Anbn with An = [Ani]. Then ani = D>} Anijbnj. AS 
Anij = Op(1) and bn; = Op(1), Anijbnj = op(1) by Exercise 2.35 (iii). 
Hence, an = 07(1), since it is the sum of k terms each of which is op(1). It 
follows that a, = A,b, =0,(1). m 


2.4 Convergence in rth Mean 


The convergence notions of limits, almost sure limits, and probability limits 
are those most frequently encountered in econometrics, and most of the 
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results in the literature are stated in these terms. Another convergence 
concept often encountered in the context of time series data is that of 
convergence in the rth mean. 


Definition 2.37 Let {b,} be a sequence of real-valued random variables 
such that for some r > 0, E|b,|" < co. If there exists a real number b such 
that E(|bn — b|") + 0 as n — oo, then b, converges in the rth mean to b, 


written b, —> b. 


The most commonly encountered situation is that in which r = 2, in 
which case convergence is said to occur in quadratic mean, denoted b, “5 
b. Alternatively, b is said to be the limit in mean square of bn, denoted 
lim. b, = b. 

A useful property of convergence in the rth mean is that it implies conver- 
gence in the sth mean for s < r. To prove this, we use Jensen’s inequality, 
which we now state. 


Proposition 2.38 (Jensen’s inequality) Let g : R — R be a conver 
function on an interval B C R and let Z be a random variable such that 
P(Z € B) = 1. Then g(E(Z)) < E(g(Z)). If g instead is concave on B, 
then g(E(Z)) > E(9(Z)). 


Proof. See Rao (1973, pp. 57-58). m 


Example 2.39 Let g(z) = |z|. It follows from Jensen’s inequality that 
|E(Z)| < E(|Z|). Let 9(z) = z?. It follows from Jensen’s inequality that 
(E(Z))* < E(2Z?). 


s.m. 


Theorem 2.40 Ifb„ 5 b andr > s, then bn > b. 


Proof. Let g(z) = 29, q < 1, z > 0. Then g is concave. Set z = |b, — b|” 
and q = s/r. From Jensen’s inequality, 

E(|bn — b|°) = E({lbn — 6|"}2) < {E (bn — OI") }". 
Since E(|b, — b|") — 0 it follows that E({|b, — |"}?) = E(|b, — b|) — 0 
and hence b, 5 b. m 


Convergence in the rth mean is a stronger convergence concept than 
convergence in probability, and in fact implies convergence in probability. 
To show this, we use the generalized Chebyshev inequality. 


Proposition 2.41 (Generalized Chebyshev inequality) Let Z be a 
random variable such that E|Z|" < co, r > 0. Then for every e > 0, 


P(|Z| > €) < E(\Z)")/e". 
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Proof. See Lukacs (1975, pp. 8-9). m 


When r = 1 we have Markov’s inequality and when r = 2 we have the 
familiar Chebyshev inequality. 


Theorem 2.42 If b, 5 b for some r > 0, then b, —> b. 


Proof. Since E(|b, — b|") —> 0 as n > œ, E(|bn — b|") < œ for all n 
sufficiently large. It follows from the generalized Chebyshev inequality that, 
for every € > 0, 


P(w : |bn(w) — b| > £) < E(|bn — b|") e". 


Hence P(w : |bna(w) — b| < e€) > 1 — E(|bn — b|")/e” — 1 as n — œ, since 
b,, 5 b. It follows that b, > b. m 

Without further conditions, no necessary relationship holds between con- 
vergence in the rth mean and almost sure convergence. For further discus- 
sion, see Lukacs (1975, Ch. 2). 

Since convergence in the rth mean will be used primarily in specify- 
ing conditions for later results rather than in stating their conclusions, we 
provide no analogs to the previous consistency results for the least squares 
or IV estimators. 
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CHAPTER 3 


Laws of Large Numbers 


In this chapter we study laws of large numbers, which provide conditions 
guaranteeing the stochastic convergence (e.g., of Z’X/n and Z’e/n), re- 
quired for the consistency results of the previous chapter. Since different 
conditions will apply to different kinds of economic data (e.g., time series or 
cross section), we shall pay particular attention to the kinds of data these 
conditions allow. Only strong consistency results will be stated explicitly, 
since strong consistency implies convergence in probability (by Theorem 
2.24). 
The laws of large numbers we consider are all of the following form. 


Proposition 3.0 Given restrictions on the dependence, heterogeneity, and 
moments of a sequence of random variables 248 Zan a =", 0, where 
Zn =n", Ze and fi, = E( Zn). 


The results that follow specify precisely which restrictions on the depen- 
dence, heterogeneity (i.e., the extent to which the distributions of the Z: 
may differ across t), and moments are sufficient to allow the conclusion 
Z -E (Za) 23; 0 to hold. As we shall see, there are sometimes trade-offs 
among these restrictions; for example, relaxing dependence or heterogeneity 
restrictions may require strengthening moment restrictions. 
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3.1 Independent Identically Distributed 
Observations 


The simplest case is that of independent identically distributed (i.i.d.) ran- 
dom variables. 


Theorem 3.1 (Kolmogorov) Let {2Z;} be a sequence of i.i.d. random 
variables. Then Zn, —> p if and only if E|Z;| < œ and E(Z;) = n. 


Proof. See Rao (1973, p. 115). m 


An interesting feature of this result is that the condition given is suffi- 
cient as well as necessary for Z, %5, u. Also note that since {Z:+} is i.i.d., 
E(Zn) = p. 

To apply this result to econometric estimators we have to know that the 
summands of Z/X/n = n`! $; _; ZX} and Z’e/n = n`! Soy, ZE: are 
i.i.d. This occurs when the elements of {(Z;,X;,€:)} are i.i.d. To prove 
this, we use the following result. 


Proposition 3.2 Let g : Rë — R! be a continuous! function. (i) Let Z 
and Z, be identically distributed. Then g(Z,) and g(Z,) are identically 
distributed. (ii) Let Z, and Z, be independent. Then g(Z,) and g(Z,) 
are independent. 


Proof. (i) Let Yı = g(Z:), Y- = g(Z_). Let A = [z : g(z) < a]. Then 
F(a) = P(Y; < al = P[Z, € A] = P[Z, € A] = PIV, < al = F(a) 
for all a € R’. Hence g(Z;) and g(Z,) are identically distributed. (ii) Let 
A; = [z : g(z) < aj], Ao = [z : g(z) < ag]. Then Fi,(aj,a2) = Pi; < 
ai, , < ag] = P[Z; € Ai, Z, € Ao) = P[Z, € Ai|P[Z, € Ag] = P[V; < 
ai|P[Y, < ae] = Fi(ai)F;(az) for all ai, ag € R’. Hence g(Z;) and g(Z,) 
are independent. W 


Proposition 3.3 If {(Zi, X}, €z)} is an i.i.d. random sequence, then {X,X;+}, 
{Xici}, {ZX}, {Zier}, and {Z,Zi} are also i.i.d. sequences. 


Proof. Immediate from Proposition 3.2 (i) and (ii). m 


To write the moment conditions on the explanatory variables in compact 
form, we make use of the Cauchy-Schwarz inequality, which follows as a 
corollary to the following result. 


l This result also holds for measurable functions, defined in Definition 3.21. 
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Proposition 3.4 (Hélder’s inequality) [fp > 1 and 1/p + 1/q = 1 and 
if E\Y|? < œ and E|Z|4 < co, then E|V2Z| < [EVP] P[E|Z|1]1/. 


Proof. See Lukacs (1975, p. 11). m 
If p = q = 2, we have the Cauchy-Schwarz inequality, 


E|YZ]| < EVI V EZ A. 


The i, jth element of X:X; is given by X`}; XthiXthj and it follows 
from the triangle inequality that 


Dp Dp 
So Xin Xtny < 5 |XthiXtnjl. 
=1 h=1 
Hence, 


E| XtniX tnj 


Me 


p 
E| S| XtniXthjl 
h=1 


> 
I 


(E Xini PE Xin 


Me 


X 
1l 
en 


by the Cauchy-Schwarz inequality. It follows that the elements of XX; will 
have E| P_i XtniXtnj| < œ (as we require to apply Kolmogorov’s law of 
large numbers), provided simply that E|Xtn:|? < œœ for all h and i. 

Combining Theorems 3.1 and 2.12, we have the following OLS consis- 
tency result for i.i.d. observations. 


Theorem 3.5 Suppose 
(i) Y: =X;bo tEn t=1,2,..., Bo E€ R5; 
(ii) {(X4,e4)} is an i.i.d. sequence, 
(iii) (a) E(X:£:) = 0; 
(b) ElX ihe <h hnap i SE laki 
(iv) (a KE ao e E E TE A a let ge 
(b) M = E(X:X;,) is positive definite. 
Then 33,, exists for all n sufficiently large a.s., and B,, > B,. 


Proof. Given (ii), {Xzez} and {X:X;} are i.i.d. sequences. The elements 
of Xe, and X;X} have finite expected absolute values, given (iii) and (iv) 
and applying the Cauchy-Schwarz inequality as above. By Theorem 3.1, 
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X'e/n = n Y, Xet 45 0, and n! 07, XX) 45 M, finite and 
positive definite, so the conditions of Theorem 2.12 are satisfied and the 
result follows. m 


This result is useful in situations in which we have observations from 
a random sample, as in a simple cross section. The result does not apply 
to stratified cross sections since there the observations are not identically 
distributed across strata, and generally will not apply to time-series data 
since there the observations (X+€+) generally are not independent. For these 
situations, we need laws of large numbers that do not impose the i.i.d. 
assumption. Since (i) is assumed, we could equally well have specified (ii) as 
requiring that {(X}, Y,)} is an i.i.d. sequence and then applied Proposition 
3.2, which implies that {(X;,e¢)} is an i.i.d. sequence. Next, note that 
conditions sufficient to ensure E(X;e€,) = 0 would be X; independent of er 
for allt and E(ez) = 0; alternatively, it would suffice that E(ez|Xz) = 0. 
This latter condition follows if E(Y:|X:) = X;6, and we define e; as 
et = Y; — E(Y|X) = Yı — X18. Both of these alternatives to (iii) are 
stronger than the simple requirement that E(X,e€;) = 0. 

In fact, by defining 6, = E(X,X'})~'E(X.:Y,) and ce; = Y, — X/G,, it 
is guaranteed that E(X:e:) = O (verify this). Thus, we are not making 
strong assumptions about how Y; is generated. Note that no restrictions 
are placed on the second moment of € in obtaining consistency for Bn- In 
fact, €, can have infinite variance without affecting the consistency of B., 


for B». 


The result for the IV estimator is analogous. 


Exercise 3.6 Prove the following result. Given 


(i) Y,=XiG@,t+e:, t=1,2,..., 8, eR; 
(ii) {(Z), Xi, e€2)} an i.i.d. sequence; 


(iii) (a) E(Zie:) = 0; 
(b) E| ZiriEtnl < œ, h = Ije P q= 1, T b; 


(iv) (a) E| Zihi Xthj| GOO h=] St E S SD ck oR 
(b) Q = E(Z,X}) has full column rank; 


(c) Ên <5 P, finite, symmetric, and positive definite. 


Then B, ezists for all n sufficiently large a.s., and B,, =S B, 
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3.2 Independent Heterogeneously Distributed 
Observations 


For cross-sectional data, it is often appropriate to assume that the obser- 
vations are independent but not identically distributed. The failure of the 
identical distribution assumption results from stratifying (grouping) the 
population in some way. The independence assumption remains valid pro- 
vided that sampling within and across the strata is random. A law of large 
numbers useful in these situations is the following. 


Theorem 3.7 (Markov) Let {Z,} be a sequence of independent random 
variables, with finite means u, = E(2Z:). If for some 6 > 0, 772, (EZ: — 
i, |!+9) /t!+° < 00, then Zn — fl, —> 0. 

Proof. See Chung (1974, pp. 125-126). m 


In this result the random variables are allowed to be heterogeneous (i.e., 
not identically distributed), but the moments are restricted by the condition 
that 572, E|Z:—b,|'t°/t!*® < 00, known as Markov’s condition. If 6 = 1, 
we have a law of large numbers due to Kolmogorov (e.g., see Rao, 1973, p. 
114). But Markov’s condition allows us to choose 6 arbitrarily small, thus 
reducing the restrictions imposed on Z. 

By making use of Jensen’s inequality and the following useful inequality, 
it is possible to state a corollary with a simpler moment condition. 


Proposition 3.8 (The c, inequality) Let and Z be random variables 
with E\Y\ < œ and E|Z|" < œ for somer > 0. Then E|Y+Z |" < c,(E|V 
"+ E|Z|"), where cr = 1 ifr < 1 ande, = 2"7! ifr > 1. 


Proof. See Lukacs (1975, p. 13). m 


Corollary 3.9 Let {Z,} be a sequence of independent random variables 
such that E\Z,|'t° < A < œ for some 6 > 0 and all t. Then Z, —ji, “> 0. 


Proof. By Proposition 3.8, 
E|Zi — p|? < P (EJZ H + Jl t). 
By assuming that E|Z,|!+° < A and using Jensen’s inequality, 
ESAE a 
It follows that for all t, 


al? <A. 
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Hence, for all t, E|Z, — u|! < 2!4°A. Verifying the moment condition 
of Theorem 3.7, we have 


OO (oe) 
Siz a er Sore Rye p< eo: 
t=1 


t=1 


since )-,~, 1/t!*® < oo for any 6 > 0. Hence the conditions of Theorem 
3.7 are satisfied and the result follows. m 


Compared with Theorem 3.1, this corollary imposes slightly more in 
the way of moment restrictions but allows the observations to be rather 
heterogeneous. 

It is useful to point out that a nonstochastic sequence can be viewed 
as a sequence of independent, not identically distributed random variables, 
where the distribution function of these random variables places probability 
one at the observed value. Hence, Corollary 3.9 can be applied to situations 
in which we have fixed regressors, provided they are uniformly bounded, 
as the condition E|Z,|'t® < A < oo requires. Situations with unbounded 
fixed regressors can be treated using Theorem 3.7. To apply Corollary 3.9 
to the linear model, we use the following fact. 


Proposition 3.10 If {(Z;,, Xj, €z)} is an independent sequence, then {XX} }, 
{Xrer}, {Ze Ki}, {Zret}, and {ZZ} are also independent sequences. 
Proof. Immediate from Proposition 3.2 (ii). m 


To simplify the moment conditions that we impose, we use the following 
consequence of Hélder’s inequality. 
Proposition 3.11 (Minkowski’s inequality) Let q > 1. If E|V|¥ < co 
and E|Z|? < œœ, then (E|Y + Z|9)'/? < (B[W9)/9 + (B(Z|9)2/2. 
Proof. See Lukacs (1975, p. 11). m 

To apply Corollary 3.9 to XX}, we need to ensure that 


Dp 
E| X XiniXinj|'*° 
hei 


is bounded uniformly in t. This is accomplished by the following corollary. 


Corollary 3.12 Suppose EX3 T < A < œ for some > 0, all h = 
1,...,p, i = 1,...,k, and all t. Then each element of X:;X;, satisfies 
E|? 4 XiniXtnj|1t? < A’ < œ for some ô > 0, all i, j = 1,...,k, 
and all t, where A’ = p!*°A. 
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Proof. By Minkowski’s inequality, 


p 
y XthiXthj 
h=1 


146 2 1+5 
j : [E Einst 1 


h=i 


By the Cauchy-Schwarz inequality, 
B\XeniXeng|'*? < [ER u AEX a 


Since E|X2;|!t < A < œ, h= 1,...,p, i = 1,... ,k, it follows that for 
all h = 1,... ,p and i,j =1,..., k, 


EX uK EL AVARA, 


so that 
5 1+6 e 1+6 
E X XtniXinj < ramen 
h=1 h=1 
pram. 
a 


The requirement that E|X3,,|!+° < A < co means that all the explana- 
tory variables have moments slightly greater than 2 uniformly bounded. A 
similar requirement is imposed on the elements of Xyze¢. 


Exercise 3.13 Show that if E|X tni€tn|1*? < A < œ for some 6 > Q, all 
h = 1,...,p, i =1,...,k, and allt, then each element of Xc, satisfies 
E| SR, Xtnietn|'t® < A’ < œ for some 6 > 0, alli =1,... ,k, and allt, 
where A’ = p! tî A. 


We now have all the results needed to obtain a consistency theorem for 
the ordinary least squares estimator. Since the argument is analogous to 
that of Theorem 3.5 we state the result as an exercise. 


Exercise 3.14 Prove the following result. Suppose 
(i Y= RUB, ten t= 2n 8B, eR 

(ii) {(X;,€:)} is an independent sequence; 

(iii) (a) E(X:e:) =0, t=1,2,...; 


(b) E|Xthietn|!t? < A < œ for some > 0, all h = 1,... p, i = 
1,...,k, and allt; 
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(iv) (a) E|X2,,|}+® < A < œ for some 6 > 0, all h = 1,...,p,i = 
1,...,k, and allt; 


(b) Mn = E(X’X/n) is uniformly positive definite. 
Then B,, exists for all n sufficiently large a.s., and Bn 2 B,- 


Compared with Theorem 3.5, we have relaxed the identical distribution 
assumption at the expense of imposing slightly greater moment restrictions 
in (iii.b) and (iv.a). Also note that (iv.a) implies that M, = O(1). (Why?) 

The extra generality we have gained now allows treatment of situations 
with fixed regressors, or observations from a stratified cross section, and 
also applies to models with (unconditionally) heteroskedastic errors. None 
of these cases is covered by Theorem 3.5. 

The result for the IV estimator is analogous. 


Theorem 3.15 Suppose 


(i) Y: = X16, + &, t= L2 ERS; 
ii) {(Z), X! e:1)} is an independent sequence; 
tA 


(iii) (a) E( Ze) = 0, t= 1, 2, sagen 


(b) E|Zini€tn|'t® < A < œ, for some 6 > 0, all h = 1,...,p, i = 
1,...,l, and all t; 


(iv) (a) E\ZiniXinj|it®> < A < œ, for some 6 > 0, all h = 1,... ,p, 
PM os ied Sloe. and at: 


(b) Qn = E(Z'X/n) has uniformly full column rank; 


(c) Ên —P,, <> 0, where P„ = O(1) and is symmetric and uniformly 
positive definite. 


Then B, exists for all n sufficiently large a.s., and B, 23, B,. 


Proof. By Proposition 3.10, {Ze} and {Z;X; } are independent sequences 
with elements satisfying the moment condition of Corollary 3.9, given (iii.b) 
and (iv.a), by arguments analogous to those of Corollary 3.12 and Exercise 
3.13. It follows from Corollary 3.9 that Z’e/n 45 0 and Z'X/n-Q, 45 0, 
where Qn = O(1) given (iv.a) as a consequence of Jensen’s inequality. 
Hence, the conditions of Exercise 2.20 are satisfied and the results follow. 
a 
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3.3 Dependent Identically Distributed 
Observations 


The assumption of independence is inappropriate for economic time series, 
which typically exhibit considerable dependence. To cover these cases, we 
need laws of large numbers that allow the random variables to be depen- 
dent. To speak precisely about the kinds of dependence allowed, we need 
to make explicit some fundamental notions of probability theory that we 
have so far used implicitly. 


Definition 3.16 A family (collection) F of subsets of a set Q is a o-field 
(o-algebra) provided 


(i) O and Q belong to F; 
(ii) if F belongs to F, then F° (the complement of F in N) belongs to F; 
(iii) if {Fi} is a sequence of sets in F, then =~, Fi belongs to F. 


For example, let F = {0,9}. Then F is easily verified to be a o-field (try 
it!). Or let F be a subset of Q and put F = {0, Q, F, F°}. Again F is easily 
verified to be a o-field. We consider further examples below. 

The pair (Q, F) is called a measurable space when F is a o-field of Q. 

The sets in a o-field F are sets for which it is possible to assign well- 
defined probabilities. Thus, we can think of the sets in F as events. We are 
now in a position to give a formal definition of the concept of a probability 
or, more formally, a probability measure. 


Definition 3.17 Let {Q, F} be a measurable space. A mapping P : F —> 
(0, 1] is a probability measure on {Q, F} provided that: 


(i) P(O) =0. 

i) For any F € F, P(F°)=1- P(F). 

(iii) For a. disjoint sequence {F;} - sets in F (i.e., Fi N Fj = 0 for all 
i # j), P (US F) = Xiz PE 


When (Q, F) is a measurable space and P is a probability measure on 
(Q, F), we call the triple (Q, F, P) a probability space. When the underly- 
ing measurable space is clear, we just call P a probability measure. Thus, 
a probability measure P assigns a number between zero and one to every 
event (F € F) in a way that coincides with our intuitive notion of how 
probabilities should behave. This powerful way of understanding probabil- 
ities is one of Kolmogorov’s many important contributions (Kolmogorov 


1933). 
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Now that we have formally defined probability measures, let us return 
our attention to the various collections of events (o-fields) that are relevant 
for econometrics. 

Recall that an open set is a set containing only interior points, where 
a point x in the set B is ¿interior provided that all points in a sufficiently 
small neighborhood of x ({y: |y — z| < e} for some e > 0) are also in B. 
Thus (a,b) is open, while (a, }] is not. 


Definition 3.18 The Borel o-field B is the smallest collection of sets 
(called the Borel sets) that includes 


(i) all open sets of R; 
(ii) the complement B° of any set B in B; 
(iii) the union 32, Bi of any sequence {B;} of sets in B. 


The Borel sets of R just defined are said to be generated by the open sets 
of R. The same Borel sets would be generated by all the open half-lines of 
R, all the closed half-lines of R, all the open intervals of R, or all the closed 
intervals of R. The Borel sets are a “rich” collection of events for which 
probabilities can be defined. Nevertheless, there do exist subsets of the real 
line not in B for which probabilities are not defined; constructing such sets 
is very complicated (see Billingsley, 1979, p. 37). 

Thus, we can think of the Borel o-field as consisting of all the events on 
the real line to which we can assign a probability. Sets not in B will not 
define events. 

The Borel o-field just defined relates to real-valued random variables. A 
simple extension covers vector-valued random variables. 


Definition 3.19 The Borel o-field B9, q < œ, is the smallest collection 
of sets that includes 


(i) all open sets of R3; 
(ii) the complement B° of any set B in B9; 
(iii) the union ~~, Bi of any sequence {B;} in B9. 


In our notation, B and B! mean the same thing. 

Generally, we are interested in infinite sequences {(Z}, X}, €z)}. If p = 1, 
this is a sequence of random 1 x (l+ k + 1) vectors, whereas if p > 1, this 
is a sequence of p x (1+k+1) matrices. Nevertheless, we can convert these 
matrices into vectors by simply stacking the columns of a matrix, one on top 
of the other, to yield a p(l+ k+ 1) x1 vector, denoted vec((Z’, X}, €¢)). (In 
what follows, we drop the vec operator and understand that it is implicit 
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in this context.) Generally, then, we are interested in infinite sequences 
of q-dimensional random vectors, where q = p(l + k + 1). Corresponding 
to these are the Borel sets of R%,, defined as the Cartesian products of a 
countable infinity of copies of R9, R?, = R9 x RI x .... In what follows we 
can think of w taking its values in Q = R%,. The events in which we are 
interested are the Borel sets of R%,, which we define as follows. 


Definition 3.20 The Borel sets of Rl, denoted B%,, are the smallest col- 
lection of sets that includes 


(i) all sets of the form x$2,B;, where each B; € BI and B; = RI except 
for finitely many îi; 
(ii) the complement F° of any set F in BY; 
(iii) the union |); Fi of any sequence {F;} in B4. 


A set of the form specified by (i) is called a measurable finite-dimensional 
product cylinder, so B3, is the Borel o-field generated by all the measur- 
able finite-dimensional product cylinders. When (R%,, BY.) is the specific 
measurable space, a probability measure P on (R%, B%,) will govern the 
behavior of events involving infinite sequences of finite dimensional vec- 
tors, just as we require. In particular, when q = 1, the elements Z;(-) of 
the sequence {Z;} can be thought of as functions from Q = R} to the real 
line R that simply pick off the tth coordinate of w € Q; with w = {zt}, 
Z:(w) = z. When q > 1, Z;(-) maps Q = R2 into R?. 

The following definition plays a key role in our analysis. 


Definition 3.21 A function g onQ to R is F¥-measurable if for every real 
number a the set w : g(w) < a] EF. 


Example 3.22 Let (Q, F) = (R%,B4,) and set q = 1. Then Z;(-) as just 
defined is BI -measurable because |w : Zi(w) < a] =[z1, ... , 2-1, Zt, 241, 
et Z1 <, ... , Zt—-1 < ©, Zt <a, 441 < 00, ...] E BY, for anya ER. 


When a function is F-measurable, it means that we can express the 
probability of an event, say, [Z; < a], in terms of the probability of an 
event in F, say, [w : Z,(w) < a]. In fact, a random variable is precisely an 
F-measurable function from 2 to R. 

In Definition 3.21, when the o-field is taken to be BY, the Borel sets of 
R¢., we shall drop explicit reference to B%, and simply say that the function 
g is measurable. Otherwise, the relevant o-field will be explicitly identified. 


Proposition 3.23 Let f and g be ¥-measurable real-valued functions, and 
let c be a real number. Then the functions cf, f +g, fg, and |f| are also 
F -measurable. 
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Proof. See Bartle (1966, Lemma 2.6). m 


Example 3.24 If Z; is measurable, then Z;/n is measurable, so that Z, = 
Si] Ze/n is measurable. 


A function from Q to R? is measurable if and only if each component 
of the vector valued function is measurable. The notion of measurability 
extends to transformations from Q to Q in the following way. 


Definition 3.25 Let (Q, F) be a measurable space. A one-to-one transfor- 
mation? T : Q — N is measurable provided that T~!(F) C F. 


In other words, the transformation T is measurable provided that any set 
taken by the transformation (or its inverse) into F is itself a set in F. This 
ensures that sets that are not events cannot be transformed into events, 
nor can events be transformed into sets that are not events. 


Example 3.26 For any w = (... ,Zt—-2,Zt—1, 2t) Zt4+1,Žt+2,...) let w = 
Tw = (... , 2-15 Zt, 241, 24-2; 2443,---), so that T transforms w by shifting 
each of its coordinates back one location. Then T is measurable since T(F) 
is in F andT—!(F) is in F, for all FEF. 


The transformation of this example is often called the shift, or the 
backshifé operator. By using such transformations, it is possible to de- 
fine a corresponding transformation of a random variable. For example, 
set Z,(w) = Z(w), where Z is a measurable function from Q to R; then 
we can define the random variables Z2(w) = Z(Tw), Z23(w) = Z(T?w), 
and so on, provided that T is a measurable transformation. The random 
variables constructed in this way are said to be random variables induced 
by a measurable transformation. 


Definition 3.27 Let (Q,F,P) be a probability space. The transformation 
T : Q — Q is measure preserving if it is measurable and if P(T-!F) = 
P(F) for all F in F. 


The random variables induced by measure-preserving transformations 
then have the property that P[Z; < a] = Plw: Zw) < a] = Plu : 
Z(Tw) < a] = P[Z2 < a]; that is, they are identically distributed. In fact, 


? The transformation T maps an element. of Q, say w, into another element of 2, say 
w’ = Tw. When T operates on a set F, it should be understood as operating on each 
element of F. Similarly, when T operates en a family F, it should be understood as 
operating on each set in the family. 
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such random variables have an even stronger property. We use the following 
definition. 


Definition 3.28 Let G, be the joint distribution function of the sequence 
{Z,,Z2,...}, where Z; is aq x 1 vector, and let G,+1 be the joint dis- 
tribution function of the sequence {Z741,Z7+2, ...}. The sequence {Z} 
is stationary if Gy = Gr+1 for each T > 1. 


In other words, a sequence is stationary if the joint distribution of the 
variables in the sequence is identical, regardless of the date of the first 
observation. 


Proposition 3.29 Let Z be a random variable (i.e., Z is a measurable 
function) and T be a measure-preserving transformation. Let Z (w) = 
Z(w), Zo(w) = Z(Tw), ..., Z,(w) = Z(T™ w) for each w in Q. Then 
{Z} is a stationary sequence. 


Proof. Stout (1974, p. 169). = 


A converse to this result is also available. 


Proposition 3.30 Let {2Z;} be a stationary sequence. Then there ezists a 
measure-preserving transformation T defined on (Q, F, P) such that Z,(w) 
= 21(w), Zo(w) =] 21 (Tw); Zao) = 2) (1 2w),-.2. 5 Z,(w) = 2(T?*e) 
for allw ing. 


Proof. Stout (1974, p. 170). m 


Example 3.31 Let {Z:} be a sequence of i.i.d. N(0,1) random variables. 
Then {Z+} is stationary. 


The independence imposed in this example is crucial. If the elements of 
{Z,} are simply identically distributed N(0,1), the sequence is not nec- 
essarily stationary, because it is possible to construct different joint dis- 
tributions that all have normal marginal distributions. By changing the 
joint distributions with t, we could violate the stationarity condition while 
preserving marginal normality. Thus stationarity is a strengthening of the 
identical distribution assumption, since it applies to joint and not simply 
marginal distributions. On the other hand, stationarity is weaker than the 
i.l.d. assumption, since i.1.d. sequences are stationary, but stationary se- 
quences do not have to be independent. 

Does a version of the law of large numbers, Theorem 3.1, hold if the i.i.d. 
assumption is simply replaced by the stationarity assumption? The answer 
is no, unless additional restrictions are imposed. 
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Example 3.32 Let li; be a sequence of 1.1.d. random variables uniformly 
distributed on [0,1] and let Z be N(0,1), independent of Ut, t = 1,2,.... 
Define Yı = Z +U. Then {Vı} is stationary (why?), but Yn = ry, n/n 


does not converge to E():z) = L. Instead, Y, — Z > L, 


In this example, Y, converges to a random variable, Z + T rather than 
to a constant. The problem is that there is too much dependence in the 
sequence {V;}. No matter how far into the future we take an observation 
on );, the initial value V; still determines to some extent what V; will be, 
as a result of the common component Z. In fact, the correlation between 
Ņı and Yy; is always positive for any value of t. 

To obtain a law of large numbers, we have to impose a restriction on 
the dependence or “memory” of the sequence. One such restriction is the 
concept of ergodicity. 


Definition 3.33 Let (N), F, P) be a probability space. Let {Z+} be a sta- 
tionary sequence and let T be the measure-preserving transformation of 
Proposition 3.80. Then {Zz} is ergodic if 


lim n`! ` P(FNT'G) = P(F)P(G) 


for all events F,G E€ F, 


If F and G were independent, then we would have P(FNG) = P(F)P(G). 
We can think of T‘G as being the event G shifted t periods into the future, 
and since P(T’G) = P(G) when T is measure preserving, this definition 
says that an ergodic process (sequence) is one such that for any events F 
and G, F and T'G are independent on average in the limit. Thus ergodicity 
can be thought of as a form of “average asymptotic independence.” For 
more on measure-preserving transformations, stationarity, and ergodicity 
the reader may consult Doob (1953, pp. 167-185) and Rosenblatt (1978). 

The desired law of large numbers can now be stated. 


Theorem 3.34 (Ergodic theorem) Let { Z;} be a stationary ergodic sca- 
lar sequence with E|Z,| < œ. Then Zn 5 p = E(Z). 


Proof. See Stout (1974, p. 181). = 

To apply this result, we make use of the following theorem. 
Theorem 3.35 Let g be an F-measurable function into R} and define 
Vt =e... , Zt-1, Zt, Zt41,...), where Z: isq x1. (i) If {Z+} is station- 


ary, then {1} is stationary. (it) If {Z+} is stationary and ergodic, then 
{Y.} is stationary and ergodic. 
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Proof. See Stout (1974, pp. 170, 182). = 


Note that g depends on the present and infinite past and future of the 
sequence {Z;}. As stated by Stout, g only depends on the present and 
future of Z,, but the result is valid as given here. 


Proposition 3.36 Jf {(Zi, X/,é¢)} is a stationary ergodic sequence, then 
{XXi}, {Kret}, (ZeX}, {Zier}, and {Z:Zi} are stationary ergodic se- 
quences. 


Proof. Immediate from Theorem 3.35 and Proposition 3.23. m 


Now we can state a result applicable to time-series data. 


Theorem 3.37 Suppose 
(i) Yr= Xi, 461, t=1,2,...,8,€ R 


(ii) {(Xt,e¢)} is a stationary ergodic sequence; 
(iii) (a) E(Xtét) = 0; 
(b) E |X thi€tnl < 00; h= ligas , P, = Tyros zk: 
(iv) (a) E\Xin:il? < œ, k=1,... ,p, i =1,... k; 
(b) M = E(X:X,;) is positive definite. 


Then B, exists for all n sufficiently large a.s., and Bn 23, B. 


Proof. We verify the conditions of Theorem 2.12. Given (ii), {X:et} and 
{X:X;} are stationary ergodic sequences by Proposition 3.36, with ele- 
ments having finite expected absolute values (given (iii) and (iv)). By the 
ergodic theorem (Theorem 3.34), X’e/n 45 0 and X’ X/n 5 M, finite 
and positive definite. Hence, the conditions of Theorem 2.12 hold and the 
results follow. m 

Compared with Theorem 3.5, we have replaced the i.i.d. assumption with 
the strictly weaker condition that the regressors and errors are stationary 
and ergodic. In both results, only the finiteness of second-order moments 
and cross moments is imposed. Thus Theorem 3.5 is a corollary of Theorem 
3.37. 

A direct generalization of Exercise 3.6 for the IV estimator is also avail- 
able. 


Exercise 3.38 Prove the following result. Suppose 


(i) Y: =X!iB, +e, t=1,2,...,8,€R*: 


(ii) {(Z;,X;,€t)} is a stationary ergodic sequence; 


46 3. Laws of Large Numbers 


(iii) (a) E(Ztéet) = 0; 
b) E|Ztniethn| < 00, h=1,...,p,i=1,... ,k; 
a) El Zi Xgl. <= OO, R= Ty ioe Pt SH Te Sek 
b) Q = E(Z:X;) has full column rank; 
) 


c) P, 5 P, finite, symmetric, and positive definite. 


( 
(iv) ( 
( 
( 


Then Bn exists for all n sufficiently large a.s., and Bn 23, B,- 


Economic applications of Theorem 3.37 and Exercise 3.38 depend on 
whether it is reasonable to suppose that economic time series are stationary 
and ergodic. Ergodicity is often difficult to ascertain theoretically (although 
it does hold for certain Markov sequences; see Stout (1974, pp. 185-200) 
and is impossible to verify empirically (since this requires an infinite sam- 
ple), although it can be tested and rejected (see, for example, Domowitz 
and El-Gamal, 1993; and Corradi, Swanson, and White, 2000). Further, 
many important economic time series seem not to be stationary but het- 
erogeneous, exhibiting means, variances, and covariances that change over 
time. 


3.4 Dependent Heterogeneously Distributed 
Observations 


To apply the consistency results of the preceding chapter to dependent 
heterogeneously distributed observations, we need to find conditions that 
ensure that the law of large numbers continues to hold. This can be done 
by replacing the ergodicity assumption with somewhat stronger conditions. 
Useful in this context are conditions on the dependence of a sequence known 
as mizing conditions. 

To specify these conditions, we use the following definition. 


Definition 3.39 The Borel o-field generated by {Z:,t=n,...,n+m}, 
denoted BP*™ = o(Zy,...,Zn4m), is the smallest c-algebra of Q that 
includes 


(i) all sets of the form x} RI x?i™ Bi XS n4m41 RI, where each B; € 
b3; 


(ii) the complement A° of any set A in Brt™; 


(iii) the union JZ, A; of any sequence {A;} in BRt™. 
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The o-field B7?t™ is the smallest o-field of subsets of Q with respect to 
which Z+, t = n,... ,n-+™ are measurable. In other words, B7*™ is the 
smallest collection of events that allows us to express the probability of an 
event, say, [Zn < a1, Zn+41 < ag], in terms of the probability of an event 
in BPt™ say [w : Zn(w) < a1, Zn41(w) < ag]. The definition of mixing is 
given in terms of the Borel o-fields generated by subsets of the history of a 
process extending infinitely far into both the past and future, {Z:}?2_.- 
For our purposes, we can think of Z, as generating the first observation 
available to us, so realizations of Z, are unobservable for t < 0. In what 
follows, this does not matter. All that does matter is the behavior of Z+, 
t < 0, if we could observe its realizations. 


Definition 3.40 Let B?” ~ =a(...,2Zn) be the smallest collection of sub- 
sets of Q that contains the union of the o-fields B? as a — —oo; let 


BO im = O(Znim;---) be the smallest collection of subsets of Q that con- 
tains the union of the o-fields By, as a — ov. 


Intuitively, we can think of B” „ as representing all the information con- 
tained in the past of the sequence {Zz} up to time n, whereas BP° _,,, rep- 
resents all the information contained in the future of the sequence {Z;} 
from time n + m on. 


We now define two measures of dependence between o-fields. 


Definition 3.41 Let G and H be o-fields and define 


(G, H) 
a(G, H) 


SUP {GeG, HEH: P(G>0)}|P(H|G) — P(#)|, 
sup (Geg,Hen}|P(G NH) — P(G)P(A)|. 


Hil 


ii 


Intuitively, ¢ and a measure the dependence of the events in H on those 
in G in terms of how much the probability of the joint occurrence of an 
event in each o-algebra differs from the product of the probabilities of 
each event occurring. The events in G and H are independent if and only if 
$(G, H) and a(G, H) are zero. The function a provides an absolute measure 
of dependence and ¢ a relative measure of dependence 


Definition 3.42 For a sequence of random vectors {Z,}, with B® ~ and 


Br m as in Definition 3.40, define the mixing coefficients 


(m) = sup ¢(B2, Bm) and a(m) =supa(B2,,, Ben). 


If, for the sequence {Zz}, d(m) —> 0 as m - œ, {Z+} is called d-mixing. 
If, for the sequence {Zz}, a(m) — 0 as m -> œ, {Zz} is called a-mixing. 
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The quantities ¢(m) and a(m) measure how much dependence exists 
between events separated by at least m time periods. Hence, if ¢(m) = 0 
or a(m) = 0 for some m, events m periods apart are independent. By al- 
lowing ¢(m) or a(m) to approach zero as m — oo, we allow consideration 
of situations in which events are independent asymptotically. In the prob- 
ability literature, ¢-mixing sequences are also called uniform mizing (see 
Iosifescu and Theodorescu, 1969), whereas a-mixing sequences are called 
strong mizing (see Rosenblatt, 1956). Because ¢(m) > a(m), ¢-mixing 
implies a-mixing. 


Example 3.43 (i) Let {Z:} be a y-dependent sequence (i.e., Z; is inde- 
pendent of Z,_, for all T > y). Then ¢(m) = a(m) =0 for allm > y. (ii) 
Let {Z} be a nonstochastic sequence. Then it is an independent sequence, 
so ¢(m) = a(m) = 0 for allm > 0. (iii) Let Zi = po Z4-1+€1,t =1,...,n, 
where |p,| < 1 and € is iid. N(0,1). (This is called the Gaussian A R(1) 
process.) Then a(m) — 0 as m — ov, although ¢(m) + 0 asm —> œ 
(Ibragimov and Linnik, 1971, pp. 312-313). 


The concept of mixing has a meaningful physical interpretation. Adapt- 
ing an example due to Halmos (1956), we imagine a dry martini initially 
poured so that 99% is gin and 1% is vermouth (placed in a layer at the 
top). The martini is steadily stirred by a swizzle stick; t increments with 
each stir. We observe the proportions of gin and vermouth in any measur- 
able set (i.e., volume of martini). If these proportions tend to 99% and 1% 
after many stirs, regardless of which volume we observe, then the process is 
mixing. In this example, the stochastic process corresponds to the position 
of a given particle at each point in time, which can be represented as a 
sequence of three-dimensional vectors {Z+}. 

The notion of mixing is a stronger memory requirement than that of 
ergodicity for stationary sequences, since given stationarity, mixing implies 
ergodicity, as the next result makes precise. 


Proposition 3.44 Let {Z;} be a stationary sequence. If a(m) — 0 as 
m — œ, then {Z} is ergodic. 


Proof. See Rosenblatt (1978). m 


Note that if (m) — 0 as m — oo, then a(m) — 0 as m — ov, so that 
-mixing processes are also ergodic. Ergodic processes are not necessarily 
mixing, however. On the other hand, mixing is defined for sequences that 
are not necessarily strictly stationary, so mixing is more general in this 
sense. For more on mixing and ergodicity, see Rosenblatt (1972, 1978). 
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To state the law of large numbers for mixing sequences we use the fol- 
lowing definition. 


Definition 3.45 Leta E R. (i) If ¢(m) = O(m-*®) for some £e > 0, then 
@ is of size —a. (ii) If a(m) = O(m~*-*) for some £ > 0, then a is of size 
—a. 


This definition allows precise statements about the memory of a random 
sequence that we shall relate to moment conditions expressed in terms of a. 
As a gets smaller, the sequence exhibits more and more dependence, while 
as a — oo, the sequence exhibits less dependence. 


Example 3.46 (i) Let {Z:} be independent Z; ~ N(0,07). Then {Z;} has 
h of size —1. (This is not the smallest size that could be invoked.) (ii) Let 
Zt be a Gaussian AR(1) process. It can be shown that {Z+} has a of size 
—a for anya E R, since a(m) decreases exponentially with m. 


The result of this example extends to many finite autoregressive mov- 
ing average (ARMA) processes. Under general conditions, finite ARMA 
processes have exponentially decaying memories. 

Using the concept of mixing, we can state a law of large numbers, due 
to McLeish (1975), which applies to heterogeneous dependent sequences. 


Theorem 3.47 (McLeish) Let {Z:} be a sequence of scalars with finite 
means jt, = E(Z;) and suppose that Y2, (E|Zı — p|" t8 /t"+8)1/" < 00 for 
some ô, 0 < < r where r > 1. If ¢ is of size —r/(2r — 1) or a is of size 
=r/(r— 1), r>1, then Zn — ji, —> 0. 


Proof. See McLeish (1975, Theorem 2.10). m 

This result generalizes the Markov law of large numbers, Theorem 3.7. 
(There we have r = 1.) 

Using an argument analogous to that used in obtaining Corollary 3.9, we 
obtain the following corollary. 


Corollary 3.48 Let {Z:} be a sequence with ¢ of size —r/ (2r — 1), r > 1, 
or a of size —r/(r — 1), r > 1, such that E|Z,|"t® < A < œ for some 
6 > 0 and allt. Then Zn — fi, 25; 0. 


Setting r arbitrarily close to unity yields a generalization of Corollary 
3.9 that would apply to sequences with exponential memory decay. For 
sequences with longer memories, r is greater, and the moment restrictions 
increase accordingly. Here we have a clear trade-off between the amount of 
allowable dependence and the sufficient moment restrictions. 

To apply this result, we use the following theorem. 
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Theorem 3.49 Let g be a measurable function into R! and define Y, = 
(21, Z141, ---, Zt47), where T is finite. If the sequence of q x 1 vectors 
{Z} is b-mizing (a-mizing) of size —a, a > 0, then {Vz} is d-mizing 
(a-mizing) of size —a. 


Proof. See White and Domowitz (1984, Lemma 2.1). m 


In other words, measurable functions of mixing processes are mixing and 
of the same size. Note that whereas functions of ergodic processes retain 
ergodicity for any 7, finite or infinite, mixing is guaranteed only for finite 
T. 


Proposition 3.50 If {(Z;,X;,c:)} is a mizing sequence of size —a, then 
{XXi}, {X:c:}, {ZX;}, {Zier}, and {Z;Z;,} are mizing sequences of size 


—a. 
Proof. Immediate from Theorem 3.49 and Proposition 3.23. m 
Now we can generalize the results of Exercise 3.14 to allow for dependence 


as well as heterogeneity. 


Exercise 3.51 Prove the following result. Suppose 
G Vie XG. ten - FS1 2-348, eR 

(ii) {(X},e¢)} is a mixing sequence with p of size —r/(2r ~1), r > 1 or 
a of size —r/(r —1), r>1; 

(iii) (a) E(X:£:) = 0, t = L; 2: Fai 
(b) E|XiniEtn|" t? < A < œ, for some > 0, h = 1,...,p, i = 

1,...,k and for all t; 

a) EIX Zete < A <œ, for some ó> 0, h= 1,...,p, ¿= 1,...,Ńķ 

and for all t; 

(b) Mn = E(X'X/n) is uniformly positive definite. 


(iv) 


Then 3,, exists for all n sufficiently large a.s., and Bn ~> B,. 


From this result, we can obtain the result of Exercise 3.14 as a direct 
corollary by setting r = 1. Compared to our first consistency result, The- 
orem 3.5, we have relaxed the independence and identical distribution as- 
sumptions, but strengthened the moment requirements somewhat. Among 
the many different possibilities that this result allows, we can have lagged 
dependent variables and nonstochastic variables both appearing in the ex- 
planatory variables X;. The regression errors ep may be heteroskedastic or 
may be serially correlated. 
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In fact, Exercise 3.51 is a powerful result that can be applied to a wide 
range of situations faced by economists. For further discussion of linear 
models with mixing observations, see Domowitz (1983). 

Applications of Exercise 3.51 often use the following result, which allows 
the interchange of expectation and infinite sums. 


Proposition 3.52 Let {Z:} be a sequence of random variables such that 
oe E|Zt| < œ. Then So", Zt converges a.s. and 


Proof. See Billingsley (1979, p. 181). = 


This result is useful in verifying the conditions of Exercise 3.51 for the 
following exercise. 


Exercise 3.53 (i) State conditions that are sufficient to ensure the con- 
sistency of the OLS estimator when Y, = GoY:-1 + 6,Xt + et, where Y,, 
Xı and cer are scalars. (Hint: The Minkowski inequality applies to infinite 
sums; that is, given {Z+} such that 57 (E|Z:|P)!/P < œ with p > 1, then 
Bei Ss (Soro, (Elz)! .) (ii) Find a simple example to which 
Exercise 3.51 does not apply. 


Conditions for the consistency of the IV estimator are given by the next 
result. 


Theorem 3.54 Suppose 
(i) Yı = XIB +e, t=1,2,...,8,€R'; 

(ii) {(Z;, Xt, €t)} is a mizing sequence with ¢ of size —r/(2r — 1), r > 1, 
ora of size —r/(r — 1), r > 1; 

(iii) (a) E(Z:e€+)=0, t=1,2,...; 
(b) E|Zensetn|"t®> < A < œ, for some 6 > 0, all h = 1,...,p,i = 
1,...,l, and all t; 

(iv) (a) ElZeniXtnj|"t®> < A < œ, for some 6 > 0, all h = 1,...,p, 
i= 1,...,l, j =1,...,k and all t; 
(b) Qn = E(Z'X/n) has uniformly full column rank; 
(c) Ên — P, “5 0, where P,, = O(1) and is symmetric and uniformly 
positive definite. 


Then Bn exists for all n sufficiently large a.s., and B, o Bo- 
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Proof. By Proposition 3.50, {Zre,} and {Z:X;} are mixing sequences 
with elements satisfying the conditions of Corollary 3.48 (given (iii.b) and 
(iv.a)). It follows from Corollary 3.48 that Z’e/n “> 0 and Z/X/n — 
Qn — > 0, where Qn = O(1), given (iv.a) as a consequence of Jensen’s in- 
equality. Hence the conditions of Exercise 2.20 are satisfied and the results 
follow. m 


Although mixing is an appealing dependence concept, it shares with 
ergodicity the property that it can be somewhat difficult to verify theo- 
retically and is impossible to verify empirically. An alternative dependence 
concept that is easier to verify theoretically is a form of asymptotic non- 
correlation. 


Definition 3.55 The scalar sequence {Z+} has asymptotically uncorre- 
lated elements (or is asymptotically uncorrelated) if there exist constants 
{p,,T > 0} such that 0 < p, < 1, par < œ and cov(2, Zt47) < 
p,(var(Z;)var(Z,47))!/? for all T > 0, where var(Z,) < œ for all t. 


Note that p, is only an upper bound on the correlation between Z; and 
Z14, and that the actual correlation may depend on t. Further, only pos- 
itive correlation matters, so that if Z,; and Z:+}, are negatively correlated, 
we can set p, = 0. Also note that for )*°)p, < œ, it is necessary that 
p, — 0 as T > œ, and it is sufficient that for all 7 sufficiently large, 
p, < T7178 for some 6 > 0. 


Example 3.56 Let Zi = p,Z:-1 + Et, where cr is t.t.d., E(€z) = 0, var(et) 
= 07, E(2,-1€1) = 0. Then corr( Zi, 2:47) = py. If0< po < 1, pan a 
1/(1 — p,) < œœ, so the sequence {Z+} is asymptotically uncorrelated. 


If a sequence has constant finite variance and has covariances that de- 
pend only on the time lag between Z; and Z;,,, the sequence is said to 
be covariance stationary. (This is implied by stationarity but is weaker 
because a sequence can be covariance stationary without being stationary.) 
Verifying that a covariance stationary sequence has asymptotically uncor- 
related elements is straightforward when the process has a finite ARMA 
representation (see Granger and Newbold, 1977, Ch. 1). In this case, p, 
can be determined from well-known formulas (see, e.g., Granger and New- 
bold, 1977, Ch. 1) and the condition Y>% 4p, < co can be directly eval- 
uated. Thus, covariance stationary sequences as well as stationary ergodic 
sequences can often be shown to be asymptotically uncorrelated, although 
an asymptotically uncorrelated sequence need not be stationary and er- 
godic or covariance stationary. Under general conditions on the size of ¢ 


3.5 Martingale Difference Sequences 53 


or @, mixing processes can be shown to be asymptotically uncorrelated. 
Asymptotically uncorrelated sequences need not be mixing, however. 

A law of large numbers for asymptotically uncorrelated sequences is the 
following. 


Theorem 3.57 Let {Z:} be a scalar sequence with asymptotically uncor- 
related elements with means p, = E(Z:) and o? = var(Z+) < A < œ. Then 


Proof. Immediate from Stout (1974, Theorem 3.7.2). m 


Compared with Corollary 3.48, we have relaxed the dependence restric- 
tion from asymptotic independence (mixing) to asymptotic uncorrelation, 
but we have altered the moment requirements from restrictions on mo- 
ments of order r+ 6 (r > 1, 6 > 0) to second moments. Typically, this is a 
strengthening of the moment restrictions. 

Since taking functions of random variables alters their correlation proper- 
ties, there is no simple analog of Proposition 3.2, Theorem 3.35, or Theorem 
3.49. To obtain consistency results for the OLS or IV estimators, one must 
directly assume that all the appropriate sequences are asymptotically un- 
correlated so that the almost sure convergence assumed in Theorem 2.18 or 
Exercise 2.20 holds. Since asymptotically uncorrelated sequences will not 
play an important role in the rest of this book, we omit stating and proving 
results for such sequences. 


3.0 Martingale Difference Sequences 


In all of the consistency results obtained so far, there has been the explicit 
requirement that either EF (Xe) = 0 or E(Ztez) = 0. Economic theory can 
play a key role in justifying this assumption. In fact, it often occurs that 
economic theory is used to justify the stronger assumption that E(ez|X¢) = 
0 or E(e:|Z:) = 0, which then implies E(Xz¢e,) = 0 or E(Zrez) = 0. In 
particular, this occurs when X46, is viewed as the value of Y; we expect to 
observe when X; occurs, so that X46, is the conditional expectation of Y; 
given Xz, i.e., E(Yt|X+t) = X6.. Then we define e; = Y; — E(Y|X) = 
Y: — X/G,. Using the algebra of conditional expectations given below, it 
is straightforward to show that E(ez|Xz) = 0. 

One of the more powerful economic theories (powerful in the sense of 
imposing a great deal of structure on the way in which the data behave) is 
the theory of rational expectations. Often this theory can be used not only 
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to justify the assumption that E(e€;|X,) = 0 but also that 
E(e:| Xt, Xe-1, --+ 5€¢-1,€t-2,--- ) = 0, 


i.e., that the conditional expectation of €+, given the entire past history of 
the errors €; and the current and past values of the explanatory variables 
X+, is zero. This assumption allows us to apply laws of large numbers for 
martingale difference sequences that are convenient and powerful. 

To define what martingale difference sequences are and to state the as- 
sociated results, we need to provide a more complete background on the 
properties of conditional expectations. 

So far we have relied on the reader’s intuitive understanding of what a 
conditional expectation is. A precise definition is the following. 


Definition 3.58 Let Y be an F-measurable random variable, E(|Y|) < œ, 
and let G be ao-field contained in F. Then there exists a random variable 
E(Y|G) called the conditional expectation of Y given G, such that 


(i) E(Y|G) is G-measurable and E(|E(Y\G)|) < oo. 
(ii) E(Y|G) satisfies 


Eia EIG) = Eta) 


for all sets G in G, where lig] is the indicator function equal to unity 
on the set G and zero elsewhere. 


As Doob (1953, p. 18) notes, this definition actually defines an entire class 
of random variables each of which satisfies the above definition, because 
any random variable that equals E()|G) with probability one satisfies this 
definition. However, any member of the class of random variables specified 
by the definition can be used in any expression involving a conditional 
expectation, so we will not distinguish between members of this class. 

To put the conditional expectation in more familiar terms, we relate this 
definition to the expectation of Y, conditional on other random variables 
Z1,t=a,...,0, as follows. 


Definition 3.59 Let VY, be a random variable such that E(|);|) < co and 


let B? = o(Zq,Zati,.--, Zp) be the a-algebra generated by the random 
vectors Z+, t =a,...,b. Then the conditional expectation of Y; given Z+, 
t=a,...,06, is defined as 


E(V:|Za,.-- , Zo) = E(V;,|B°). 
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The conditional expectation E()|B®) can be expressed as a measurable 
function of Z+, t =a,... , b, as the following result shows. 


Proposition 3.60 Let Be = 0(Zq, Za+1,--., Zs). Then there exists a 
measurable function g such that 


Ey Vil Zasseog le) = 0 B aea): 
Proof. Immediate from Doob (1953, Theorem 1.5, p. 603). m 


Example 3.61 Let Y and Z be jointly normal with E(Y) = E(Z) = 0, 
var(Y) = 03,, var(Z) = 0%, cov(Y, Z) = o yz. Then 


E(Y|Z) = (0y2/03)Z. 


The role of economic theory can now be interpreted as specifying a par- 
ticular form for the function g in Proposition 3.60, although, as we can 
see from Example 3.61, the g function is in fact a direct consequence of 
the form of the joint distribution of the random variables involved. For 
an economic theory to be valid, the g function specified by that economic 
theory must be identical to that implied by the joint distribution of the 
random variables, otherwise the economic theory provides only an approx- 
imation to the statistical relationship between the random variables under 
consideration. 

We now state some useful properties of conditional expectations. 


Proposition 3.62 (Linearity of conditional expectation) Let ay, 
... Qk be finite constants and suppose );,..., Yk are random variables 
such that E|);| < œ, j =1,... ,k. Then E(a;|G) =a; and 


k k 


E(X a;Y;l8) = X a; E(V,19). 


j=l j=l 
Proof. See Doob (1953, p. 23). m 


Proposition 3.63 If Y is a random variable and Z is a random variable 
measurable with respect to G such that E|)| < co and E|ZY| < ov, then 
with probability one 


E(ZY|G) = ZE(Y|G) 
and 


E((Y — E(Y|G)|Z) = 0. 
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Proof. See Doob (1953, p. 22). m 


Example 3.64 Let G = a(X;). Then E(X:Y |X) = XE (Y;|Xz). Define 
Et = Y; = E(Y|X). Then E(Xté:) = E(Xi/Y; = E(Y:|X+))) = U. 
If we set E(Y:|X:) = X;ß,, the result of this example justifies the 


orthogonality condition for the OLS estimator, E(X,€,) = 0. 
A version of Jensen’s inequality also holds for conditional expectations. 


Proposition 3.65 (Conditional Jensen’s inequality) Let g : R — R 
be a conver function on an interval B C R and let Y be a random variable 
such that P[Y € B] = 1. If E|Y| < œ and E|g(V)| < co, then 


g[E(VIG)] < E(g()|9) 
for any o-field G of sets in Q. If g is concave, then 

g|E(VIG)] > E(9WV)|G). 
Proof. See Doob (1953, p. 33). m 
Example 3.66 Let g(y) = |y|. It follows from the conditional Jensen’s 
inequality that |E(V|G)| < E(|V||G). 
Proposition 3.67 Let G and H be o-fields and suppose G C H and that 
some version of E(Y|H) is measurable with respect to G. Then 

E(Y|H) = E(YIG), 

with probability one. 


Proof. See Doob (1953, p. 21). m 

In other words, conditional expectations with respect to two different 
o-fields, one contained in the other, coincide provided that the expectation 
conditioned on the larger o-field is measurable with respect to the smaller 
o-field. Otherwise, no necessary relation holds between the two conditional 
expectations. 


Example 3.68 Suppose that E();|Hi-1) = 0, where Hi-1 = o(... , Vt-2, 
Y,-1). Then E(Vi|Vı—1) = 0, since E(Vi|VYı-1) = E(Vi|Ge-1), where Gr_1 = 
o(VYı—ı) satisfies Gy C Hy-1 and E(Y,|Hi-1) = 0 is measurable with re- 
spect to Gy_-1- 


One of the most useful properties of the conditional expectation is given 
by the law of iterated expectations. 
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Proposition 3.69 (Law of iterated expectations) Let E|Y| < œ and 
let G be ao-field of sets in Q. Then 


E|E(Y|G)] = EW). 
Proof. Set G = Q in Definition 3.58. m 
Example 3.70 Suppose E(e,|X,) = 0. Then by Proposition 3.69, E(e:) = 
E(E(e:|X+)) = 0. 
A more general result is the following. 


Proposition 3.71 (Law of iterated expectations) Let G and H be o- 
fields of sets in Q with H C G, and suppose E(||) < œ. Then 


B[E(Y|G)|H] = E(YIH). 


Proof. See Doob (1953, p. 37). = 

Proposition 3.69 is the special case of Proposition 3.71 in which H = 
{0, Q}, the trivial o-field. 

With the law of iterated expectatiens available, it is straightforward to 
show that the conditional expectation has an optimal prediction property, 
in the sense that in predicting a random variable Y, the prediction mean 
squared error of the conditional expectation of Y is smaller than that of 
any other predictor of Y measurable with respect to the same o-field. 


Theorem 3.72 Let Y be a random variable with E(Y?) < co and let y= 
E(Y|G). Then for any other G-measurable random variable Y, E((Y — 


a 


y)*) < E((Y — y)*). 
Proof. Adding and subtracting ) in (V — Y)? gives 
E(y-Y)?) = E(Y-V+V-H)?) 
= E(X -YP + 2E((Y- YY -Y)) 
+E((¥ =P) 
By the law of iterated expectations and Proposition 3.63, 


E(Y-Y)\Y-Y)) = FEW -Ý -Y)I9) 
EIE — Vig) - Y). 


But E(YV — YG) = 0, so E((Y — Y)(Y—Y)) = 0 and 
E((y — Y)?) = E(X - Ý}? + E((V - Y)?), 
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and the result follows, because the final term on the right is nonnegative. 
E 

This result provides us with another interpretation for the conditional 
expectation. The conditional expectation of Y given G gives the minimum 
mean squared error prediction of Y based on a specified information set 
(o-field) G. 

With the next definition (from Stout, 1974, p. 30) we will have sufficient 
background to define the concept of a martingale difference sequence. 


Definition 3.73 Let {XV} be a sequence of random scalars, and let { Fi} be 
a sequence of o-fields Fe C F such that Fi—ı C Fi for allt (i.e., {Fi} is an 
increasing sequence of o-fields, also called a filtration). If Yı is measurable 
with respect to Fz, then {F+} is said to be adapted to the sequence {);} 
and {VYı, Fe} is called an adapted stochastic sequence. 


One way of generating an adapted stochastic sequence is to let F; be 
the o-field generated by current and past Jj, i.e., Fe = of... , Vı-1, Ve). 
Then {Fi} is increasing and V; is always measurable with respect to F. 
However, F; can contain more than just the present and past of );; it can 
also contain the present and past of other random variables as well. For 
example, let V; = Zu, where Z; = (Zt,...,Ztq), and let Fi = o(..., 
Z:-1,Z1). Then F is again increasing and V; is again measurable with 
respect to Fi, so {V+, Fe} is an adapted stochastic sequence. This is the 
situation most relevant for our purposes. 


Definition 3.74 Let {);,F:} be an adapted stochastic sequence. Then {);, 
F,} is a martingale difference sequence if 


E(\:|\F:-1) =0, for allt > 1. 


Example 3.75 (i) Let {V;} be a sequence of i.i.d. random variables with 
E(y1) = 0, and let Ri = o(..., Ve-1, Yı). Then {V;, Fı} is a martingale 
difference sequence. (ii) (The Lévy device) Let {):,F%} be any adapted 
stochastic sequence such that E|);| < œ for all t. Then 


{Ve — E(Vi|Fi-1), Fe} 


is a martingale difference sequence because VY; — E():|F:-1) is measurable 
with respect to F; and, by linearity, 


E|Y, — E(Vi|Fi-1)|Fr-1] = EV] Fi-1) — EO4t|Ft-1) = 0. 


The device of Example 3.75 (ii) is useful in certain circumstances because 
it reduces the study of the behavior of an arbitrary sequence of random 
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variables to the study of the behavior of a martingale difference sequence 
and a sequence of conditional expectations (Stout, 1974, p. 33). 

The martingale difference assumption is often justified in economics by 
the efficient markets theory or rational expectations theory, e.g., Samuelson 
(1965). In these theories the random variable Y, is the price change of 
an asset or a commodity traded in a competitive market and F, is the 
o-field generated by all current and past information available to market 
participants, Fe = o(... , Zt-1, Zt), where Z; is a finite-dimensiona) vector 
of observable information, including information on );. A zero profit (no 
arbitrage) condition then ensures that E()4|7:-,) = 0. Note that if G; = 
o(... ,Vı—1, Yı), then {V;, G+} is also an adapted stochastic sequence, and 
because G; C Fa, it follows from Proposition 3.71 that 


E(¥:|Gt-1) = ELE(Vi|Fi-1)|Ge_-1] = 0, 


so {V:, G+} is also a martingale difference sequence. 

The martingale difference assumption often arises in a regression context 
in the following way. Suppose we have observations on a scalar Y; (set p = 1 
for now) that we are interested in explaining or forecasting on the basis of 
variables Z; as well as on the basis of the past values of Y;. Let F—ı be 
the o-field containing the information used to explain or forecast Y;, i.e., 
Fi- =o(... , (Zii Yt-2)’; (Zi, %_-1)’). Then by Proposition 3.60, 


E(¥:|Fe—-1) = 9(--- (Zea, Ye-2)'s (Zi Ye-1)’); 


where g is some function of current and past values of Z; and past values of 
Y;. Let X; contain a finite number of current and lagged values of (Z, Y:_1), 
e.g, Xt = ((Zi_,, Yi-r-1)’,-.- , (Z2, Ye-1)’)’ for some T < oo. Economic 
theory is then often used in an attempt to justify the assumption that for 
some 3, < œ, 


g(... s (Zizi Ye-2)’s (Zi, ¥e-1)’) = Xi Bo. 
If this is true, we then have 
E(Y:|Fe-1) = X48. 


Note that by definition, Y; is measurable with respect to F;, so that {Y:, Fa} 
is an adapted stochastic sequence. Hence, by the Lévy device, we find that 
{Y, — E(Y:|F,-1), Fe} is a martingale difference sequence. If we let 


& = Y; - X;B, 
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and it is true that E(Y;|7:-1) = X{G,, then et = Yı — E(Y:|Fi-1), so 
{€z, Fe} is a martingale difference sequence. Of direct importance for least 
squares estimation is the fact that {F+} is also adapted to each sequence 
of cross products between regressors and errors {XtiEt}, i = 1,... ,k. It is 
then easily shown that {X1i€2, Fe} is also a martingale difference sequence, 
since by Proposition 3.63 


E( Xeiee|Ft—-1) = Xi E(ez|Ft-1) = 0. 


A law of large numbers for martingale difference sequences is the follow- 
ing theorem. 


Theorem 3.76 (Chow) Let {Z:, Fi} be a martingale difference sequence. 
If for some r > 1, <2, (E|Z|2")/t?*" < 00, then Zn > 0. 


Proof. See Stout (1974, pp. 154~155). m 


Note the similarity of the present result to the Markov law of large num- 
bers, Theorem 3.7. There the stronger assumption of independence replaces 
the martingale difference assumption, whereas the required moment condi- 
tions are weaker with independence than they are here. A corollary analo- 
gous to Corollary 3.9 also holds. 


Exercise 3.77 Prove the following. Let {Z:,F,} be a martingale difference 
sequence such that E|Z,|2" < A < œ for some r > 1 and all t. Then 


E50. 


Using this result and the law of large numbers for mixing sequences, we 
can state the following consistency result for the OLS estimator. 


Theorem 3.78 Suppose 
(2) Ye = X18, +E, 6242 ce Pe ER™ 


(ii) {Xz} is a sequence of mizing random variables with ¢ of size —r / (2r — 
1), r >1, ora of size —r/(r — 1), r > 1; 


(iii) (a) {Xtni€tn, Ft-ı} is a martingale difference sequence, h = 1,... ,p, 
t= lak: 
(b) E |XtniEtn| 7 < A < œ, for allh=1,... p, i=1,. .,k and t; 
(iv) (a) ee |rt8 < A < o, ete ee rere ae Peete t 
i = ,k and t; 
(b) Mn = E(X’X/n) is uniformly positive definite. 


3.5 Martingale Difference Sequences 61 


Then 3,, exists for all n sufficiently large a.s., and Bn 23, Bo: 


Proof. To verify that the conditions of Theorem 2.18 hold, we note first 
that X’e/n = X} _; Xi ea/n where X,» is the n x k matrix with rows 
Xin and Enp is the n x 1 vector with elements E:n. By assumption (iii.a), 
{ XtniEtn, Fe} is a martingale difference sequence. Because the moment con- 
ditions of Exercise 3.77 are satisfied by (iii.b), we have n=! Ð XthiEth ==, 
h = 1,...,p,i = 1,...,k, so X’e/n =) 0 by Proposition 2.11. Next, 
Proposition 3.50 ensures that {X:X;} is a mixing sequence (given (iz)) 
that satisfies the conditions of Corollary 3.48 (given (iv.a)). It follows that 
X'X/n—M,, +5 0, and Mn = O(1) (given (iv.a)) by Jensen’s inequality. 
Hence the conditions of Theorem 2.18 are satisfied and the result follows. 
E 

Note that the conditions placed on X: by (ii) and (iv.a) ensure that 
X'X/n — Mn 23, 0 and that these conditions can be replaced by any 
other conditions that ensure the same conclusion. 

A result for the IV estimator can be obtained analogously. 


Exercise 3.79 Prove the following result. Given 


(i) Y: = XB +e, t=1,2,...,8,€R'; 

(ii) {(Zi, X4,€4)} is a mizing sequence with ọ of size —r/(2r—1), r > 1, 
or a of size —r/(r—1), r>1; 

(iii) (a) {ZthiEth, Ft} is a martingale difference sequence, h = 1,... ,p, 
(ig eee 
(b) E|Ztni€tal2” < A < œ, forallh =1,...,p,i=1,...,l andt; 

(iv) (a) E|ZthriXthj|" t < A < œ, for some 6 > 0 and all h = 1,..., 

P 

tS dope kand t, 
(b) Qn = E(Z'X/n) has uniformly full column rank; 
(c) Ê„- P, &5 0, where P,, = O(1) and is symmetric and uniformly 
positive definite. 


) 


Then Bn exists for alln sufficiently large a.s., and B. 23, Bo. 


As with results for the OLS estimator, (ii) and (iv.a) can be replaced by 
any other conditions that ensure Z'X/n — Q, 45 0. Note that assump- 
tion (ii) is stronger than absolutely necessary here. Instead, it suffices that 
{(Z;, X;)} is appropriately mixing. However, assumption (ti) is used later 
to ensure the consistency of estimated covariance matrices. 
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CHAPTER 4 


Asymptotic Normality 


In the classical linear model with fixed regressors and normally distributed 
1.1.0. errors, the least squares estimator Br, is distributed as a multivariate 
normal with E(,,) = 6, and var(ĝ„) = 02(X’X)7~! for any sample size 
n. This fact forms the basis for statistical tests of hypotheses, based typ- 
ically on t- and F-statistics. When the sample size is large, econometric 
estimators such as B, have a distribution that is approximately normal 
under much more general conditions, and this fact forms the basis for large 
sample statistical tests of hypotheses. In this chapter we study the tools 
used in determining the asymptotic distribution of B„, how this asymp- 
totic distribution can be used to test hypotheses in large samples, and how 
asymptotic efficiency can be obtained. 


4.1 Convergence in Distribution 
The most fundamental concept is that of convergence in distribution. 


Definition 4.1 Let {bn} be a sequence of random finite-dimensional vec- 
tors with joint distribution functions {Fn}. If F,(z) — F(z) as n --» œo for 
every continuity point z, where F is the distribution function of a random 
variable Z, then b, converges in distribution to the random variable Z, 


denoted bn 4 Z. 
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Heuristically, the distribution of b, gets closer and closer to that of the 
random variable Z, so the distribution F can be used as an approximation 


to the distribution of ba. When bn Z , we also say that b, converges in 
law to Z (written b, Æ 4 ), or that b, is asymptotically distributed as 


F, denoted b,, AF Then F is called the limiting distribution of bn. Note 
that the convergence specified by this definition is pointwise and only has 
to occur at points z where F is continuous (the “continuity points”). 


Example 4.2 Let {bn} be a sequence of i.i.d. random variables with dis- 
tribution function F. Then (trivially) F is the limiting distribution of bn. 


This illustrates the fact that convergence in distribution is a very weak 
convergence concept and by itself implies nothing about the convergence 
of the sequence of random variables. 


Example 4.3 Let {Z;} be i.i.d. random variables with mean pı and finite 
variance o? > 0. Define 


Then by the Lindeberg-Lévy central limit theorem (Theorem 5.2), by x 
N(0, 1). 


In other words, the sample mean Z,, of i.i.d. observations, when standard- 
ized, has a distribution that approaches the standard normal distribution. 
This result actually holds under rather general conditions on the sequence 
{ Zt}. The conditions under which this convergence occurs are studied at 
length in the next chapter. In this chapter, we simply assume that such 
conditions are satisfied, so convergence in distribution is guaranteed. 

Convergence in distribution is meaningful even when the limiting distri- 
bution is that of a degenerate random variable. 


Lemma 4.4 Suppose bn EEN, (a constant). Then bn fs Fp, where F, is 
the distribution function of a random variable Z that takes the value b with 


probability one (i.e., bn = b). Also, if bn z F,, then bn — b. 


Proof. See Rao (1973, p. 120). m 


In other words, convergence in probability to a constant implies convergence 
in distribution to that constant. The converse is also true. 
A useful implication of convergence in distribution is the following lemma. 
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Lemma 4.5 Jf b, s Z, then b, is Op(1). 


Proof. Recall that b, is O,(1) if, given any 6 > 0, P [||[bn| > As] < 6 
for some As < œ and all n > Ns. Because b, —> Z, P[|b,| > As] —> 
P [|Z| > Ags], provided (without loss of generality) that As and —Asg are 
continuity points of the distribution of Z. Thus |P[|b,| > As] — P[|Z| > 
As]| < 6 for all n > Ng, so that P[{lb,| > As] < 6+ P[|Z| > As] for 
all n > Ns. Because P{|Z| > Aa} < 6 for As sufficiently large, we have 
Pl|b,| > Ags] < 26 for As sufficiently large and all n > Ns. m 


This allows us to establish the next useful lemma. 


Lemma 4.6 (Product rule) Recall from Corollary 2.86 that if An = 
Op(1) and by = O,(1), then Anbn = op(1). Hence, if An => 0 and 
bn E Z, then A,b, 50. 


In turn, this result is often used in conjunction with the following result, 
which is one of the most useful of those relating convergence in probability 
and convergence in distribution. 


Lemma 4.7 (Asymptotic equivalence) Let {an} and {bn} be two se- 


quences of random vectors. If a, — bn ”,0 andb, 28 , then an EA 


Proof. See Rao (1973, p. 123). m 


This result is helpful in situations in which we wish to find the asymptotic 
distribution of a, but cannot easily do so directly. Often, however, it is easy 
to find a b, that has a known asymptotic distribution and that satisfies 
a, — b, —> 0. Lemma 4.7 then ensures that a, has the same limiting 
distribution as b, and we say that a, is “asymptotically equivalent” to 
bn. The joint use of Lemmas 4.6 and 4.7 is the key to the proof of the 
asymptotic normality results for the OLS and IV estimators. 

Another useful tool in the study of convergence in distribution is the 
characteristic function. 


Definition 4.8 Let Z be ak x1 random vector with distribution function 
F. The characteristic function of Z is defined as f(A) = E(exp(id'Z)), 


where i? = —1 and À is ak x 1 real vector. 


Example 4.9 Let Z be a nonstochastic real number, Z = c. Then f(A) = 
E(exp(iAZ)) = E(exp(iAc)) = exp(iàc). 


Example 4.10 (i) Let Z ~ N(p,07). Then f(A) = expliàp — A?0?/2). 
(ii) Let Z ~ N(u, ©), where p isk x 1 and È isk xk. Then f(A) = 
expli u — A'XN/2). 
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A useful table of characteristic functions is given by Lukacs (1970, p. 18). 

Because the characteristic function is the Fourier transformation of the 
probability density function, it has the property that any characteristic 
function uniquely determines a distribution function, as formally expressed 
by the next result. 


Theorem 4.11 (Uniqueness theorem) Two distribution functions are 
identical if and only if their characteristic functions are identical. 
Proof. See Lukacs (1974, p. 14). m 


Thus the behavior of a random variable can be studied either through 
its distribution function or its characteristic function, whichever is more 
convenient. 


Example 4.12 The distribution of a linear transformation of a random 
variable is easily found using the characteristic function. Consider Y = 
AZ, where A isaqxk matriz and Z is a random k-vector. Let 0 beq x 1. 
Then 


fy(@) = E(exp(i6"y)) = E(exp(i@’AZ)) 
= Elexp(ir’Z)) = fz (A), 


defining A = A'0. Hence if Z ~ N(y,d), 


fy (9) 


fz(A) = expliA' p — NEA/2) 
= exp(i6’Ap — 0’ADA’#/2), 


so that Y ~ N(Ap, AXA’) by the uniqueness theorem. 
Other useful facts regarding characteristic functions are the following. 
Proposition 4.13 Let Y =aZ +b, a,b € R. Then 
fy (A) = fz(aà) exp(iàb). 
Proof. 


E(exp(iàVY)) = E(exp(iA(aZ + b))) 
= FE(exp(i\aZ) exp(iAb)) 
= E(exp(iAaZ)) exp(irAb) = fz(Aa) exp(2Ab). 


I 


fy() 
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Proposition 4.14 Let Y and Z be independent. Then if X = V+ Z, 
fx(A) = fr) fz (A). 


Proof. 


I 


fx (A) E(exp(iA%)) = E(exp(iAY + Z))) 
= E(exp(iAy) exp(2AZ)) 


) 
E(exp(iAy))E(exp(iAZ)) 
( 


by independence. Hence fx(X) = fy(A) fz(A). 


Proposition 4.15 If the kth moment u, of a distribution function F ez- 
ists, then the characteristic function f of F can be differentiated k times 
and f‘*)(0) = ifu, where f) is the kth derivative of f. 


Proof. This is an immediate corollary of Lukacs (1970, Corollary 3 to 
Theorem 2.3.1, p. 22). m 


Example 4.16 Suppose that Z ~ N(0,07). Then f'(0) = 0, f”(0) = —o?, 
FOO) = 0, etc. 


The main result of use in studying convergence in distribution is the 
following. 


Theorem 4.17 (Continuity theorem) Let {b,} be a sequence of ran- 
dom k x 1 vectors with characteristic functions {fn(A)}. If bn k A , then 
for every A, fn(A) > F(A), where f(A) = E(exp(tA’Z)). Further, if for 
every A, fn(A) — f(A) and f is continuous at A = 0, then bn 2 Z, where 
F(A) = E(exp(én’2Z)). 


Proof. See Lukacs (1970, pp. 49-50). m 


This result essentially says that convergence in distribution is equivalent 
to convergence of characteristic functions. The usefulness of the result is 
that often it is much easier to study the limiting behavior of characteris- 
tic functions than distribution functions. If the sequence of characteristic 
functions f» converges to a function f that is continuous at A = 0, this 
theorem guarantees that the function f is a characteristic function and that 
the limiting distribution F' of b, is that corresponding to the characteristic 
function f(A). 

In all the cases that follow, the limiting distribution F will be either that 
of a degenerate random variable (following from convergence in probability 
to a constant) or a multivariate normal distribution (following from an 
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appropriate central limit theorem). In the latter case, it is often convenient 
to standardize the random variables so that the asymptotic distribution is 
unit multivariate normal. To do this we can use the matrix square root. 


Exercise 4.18 Prove the following. Let V be a positive (semi) definite 
symmetric matriz. Then there exists a positive (semi) definite symmetric 
matriz square root V1/2 such that the elements of V? are continuous 
functions of V and V1/2V1/2 = V. (Hint: Express V as V = Q'DQ, where 
Q is an orthogonal matriz and D is diagonal with the eigenvalues of V 
along the diagonal.) 


Exercise 4.19 Show that if Z ~ N(0,V), then V-'/?Z ~ N(0,1), pro- 
vided V is positive definite, where V~+/? = (V1/2)-!. 


Definition 4.20 Let {bn} be a sequence of random vectors. If there exists 
a sequence of matrices {Vn} such that V, is nonsingular for all n suf- 
ficiently large and Vn 2, AN (0,1), then Vn is called the asymptotic 
covariance matrix of ba, denoted avar(b,). 


When var(b,,) is finite, we can usually define V,, = var(b,,). Note that 
the behavior of b, is not restricted to require that V, converge to any 
limit, although it may. Generally, however, we will at least require that the 
smallest eigenvalues of V„ and V;! are uniformly bounded away from zero 
for all n sufficiently large. Even when var(b,,) is not finite, the asymptotic 
covariance matrix can exist, although in such cases we cannot set V, = 
var(bn). 


Example 4.21 Define bn = Z+)/n, where Z ~ N(0,1) and Y is Cauchy, 


independent of Z. Then var(b,,) is infinite for every n, but bn AN (0,1) as 
a consequence of Lemma 4.7. Hence avar(bn) = 1. 


Given a sequence {Vn MY bn} that converges in distribution, we shall of- 
ten be interested in the behavior of linear combinations of bn, say, {Anbn}, 
where An, like Vn a a is not required to converge to a particular limit. We 
can use characteristic functions to study the behavior of these sequences 
by making use of the following corollary to the continuity theorem. 


Corollary 4.22 If X € R* and a sequence {f,(A)} of characteristic func- 
tions converges to a characteristic function f(X), then the convergence is 
uniform in every compact subset of R*. 


Proof. This is a straightforward extension of Lukacs (1970, p. 50). m 
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This result says that in any compact subset of R* the distance between 
fn(A) and f(A) does not depend on A, but only on n. This fact is crucial 
to establishing the next result. 


Lemma 4.23 Let {b,,} be a sequence of random k x 1 vectors with char- 
acteristic functions {fn(A)}, and suppose fn(A) > f(A). If {An} is any 
sequence of q x k nonstochastic matrices such that A, = O(1), then the 
sequence {A,b,,} has characteristic functions { f*(@)}, where 0 is q x 1, 
such that for every 0, f*(@) — f(At,@) — 0. 


Proof. From Example 4.12, f*(@) = fn(A‘,@). For fixed 0, A, = A/,0 takes 
values in a compact region of R*, say, No, for all n sufficiently large because 
An = O(1). Because f(A) —> f(A), we have fn(An) —f(An) > 0 uniformly 
for all An in Ng, by Corollary 4.22. Hence for fixed 0, f,(A/,@) — f(A‘,@) = 
f*(0) — f(A/,@) — 0 for any O(1) sequence {An}. Because @ is arbitrary, 
the result follows. m 

The following consequence of this result is used many times below. 


Corollary 4.24 Let {b,,} be a sequence of random k x 1 vectors such that 
V3? bn £ N(0,1), where {Vn} and {Vz} are O(1). Let {An} be a O(1) 
sequence of (nonstochastic) q x k matrices with full row rank q for all n 
sufficiently large, uniformly in n. Then the sequence {Anbn} is such that 


rT-1/2A„bn £ N(0,1), where Ln = An Vn A’, andT,, and! are O(1). 

Proof. [, = O(1) by Lemma 2.19. rT}! = O(1) because P, = O(1) 
and det(I',) > 6 > 0 for all n sufficiently large, given the conditions on 
{An} and {Vn}. Let f*(@) be the characteristic function of [7 1/2 An bn = 
TAL Vie Vn bn. Because Po AL = O(1), Lemma 4.23 ap- 
plies, implying f*(0)— f(ViP ALTA 1/70) — 0, where f(A) = exp(—A‘A/2), 
the limiting characteristic function of Vn'/“b,. Now f(V} “ALT !/26) = 
exp(-0'T 71/2? A, Vn ALT; !/?0/2) = exp(—0/0/2) by definition of [> 1/”. 
Hence f*(A) — exp(—0’0/2) > 0, so T7! Anbn A N(O,1) by the conti- 
nuity theorem (Theorem 4.17). m 


This result allows us to complete the proof of the following general 
asymptotic normality result for the least squares estimator. 


Theorem 4.25 Given 
(i) Y: = Xib, + £t, PH 1 eee BER 


(ii) Vz n 12'e A N(0,1), where Vy = var(n™!/? X'e) is O(1) and 
uniformly positive definite; 
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(iii) X'X/n -Mn > 0, where Mn = E(X'X/n) is O(1) and uniformly 
positive definite. 


Then 
Dz? Vn(B, — Bo) * N(0,1), 
where Dn =M7z!V,M;! and D}! are O(1). Suppose in addition that 


(iv) there exists a matriz V„ positive semidefinite and symmetric such that 


Vn-Vn = 0. Then Dn Dr- 0, where 
DeX K/A VA ay 


Proof. Because X’ X/n —M, — 0 and M, is finite and nonsingular by 
(iii), (X'/X/n)~} and ,, exist in probability. Given (i) and the existence 
of (X’X/n)7! 


VI(By = Bo) = (X'X/n) nV? X'e. 
Hence, given (ii), 


a 


Vn( Bn — Bo) E M; n1? X'e 
= (XX /n)? - Mp JVP Va nX e, 


or, premultiplying by Dn a n 


Dz? Vn(Ên — Bo) — Da V? M; n7! ?X'e 
= Dy? ((X’X/n)7! =M IV Va Pan X'e 


The desired result will follow by applying the product rule Lemma 4.6 
to the line immediately above, and the asymptotic equivalence Lemma 
4.7 to the preceding line. Now Vien WexX'e À N(0,1) by (ii); further, 
D; P |(X/X/n)7} — M7!) V}? is Op(1) because D}? and Vi? are O(1) 
given (iz) and (iii), and [(X’X/n)~! — Mz"] is op(1) by Proposition 2.30 
given (iii). Hence, by Lemma 4.5, 


Dz? Vn Bn =p) > D7! M7'n X'e E 0. 


By Lemma 4.7, the asymptotic distribution of Dr” 2 /n(B,, — B,) is the 
same as that of Dn a "M7 17—-1/2X'e. We find the asymptotic distribution 


4.1 Convergence in Distribution 73 


of this random variable by applying Corollary 4.24, which immediately 
yields Dx /?Mz=!n-1/2X'e Å N(0, 1). 

Because (ii), (iii), and (iv) hold, Dn — D, = 0 is an immediate con- 
sequence of Proposition 2.30. m 

The structure of this result is very straightforward. Given the linear 
data generating process, we require only that (X’X/n) and (X’X/n)7~! are 
O,(1) and that n—'/2X'e is asymptotically unit normal after standardizing 
by the inverse square root of its asymptotic covariance matrix. The asymp- 
totic covariance (dispersion) matrix of Valna — B.) is Dn, which can be 
consistently estimated by D„. Note that this result allows the regressors 
to be stochastic and imposes no restriction on the serial correlation or het- 
eroskedasticity of €+, except that needed to ensure that (ii) holds. As we 
shall see in the next chapter, only mild restrictions need to be imposed in 
guaranteeing (ii). 

In special cases, it may be known that V„ has a special form. For ex- 
ample, when e+ is an i.i.d. scalar with E(e,) = 0, E(e?) = o2, and X; is 
nonstochastic, then Vn = 02X'X/n. Finding a consistent estimator for V„ 
then requires no more than finding a consistent estimator for g2. 

In more general cases considered below it is often possible to write 


Vn = E(X’ee'X/n) = E(X’2,X/n). 


Finding a consistent estimator for V,, in these cases is made easier by the 
knowledge of the structure of Qn. However, even when Q, is unknown, it 
turns out that consistent estimators for V, are generally available. The 
conditions under which V,, can be consistently estimated are treated in 
Chapter 6. 

A result analogous to Theorem 4.25 is available for instrumental variables 
estimators. Because the proof follows that of Theorem 4.25 very closely, 
proof of the following result is left as an exercise. 


Exercise 4.26 Prove the following result. Given 
(i) Yk=Xi6,+e1, t=1,2,..., B, E€ R5; 


(ii) Vian /2Z'e Š N(0,1), where V, = var(n—1/2Z/e) is O(1) and 
uniformly positive definite, 

(iii) (a) Z/X/n—Qn = 0, where Qn = E(Z'X/n) is O(1) with uniformly 
full column rank, 
(b) There exists P, such that Ên — Pan +0 and Pn = O(1) and is 
symmetric and uniformly positive definite. 
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Then D}? \/n(B,, — B,) Å N(O, I), where 
Dn = (Q,PnQn)7'Q),PnVnPrQn(Qi.PrQn) 

and D7! are O(1). 

Suppose in addition that 
(iv) There exists a matrix Vn positive semidefinite and symmetric such 

that Vn — Vn, = 0. 

Then Ô, —D, 28 0, where 

Dn = (X/ZP,Z/X/n?)—'(X'Z/n) Pp VnPn(Z’X/n)(X'ZPpZ’X/n?)7}. 


4.2 Hypothesis Testing 


A direct and very important use of the asymptotic normality of a given 
estimator is in hypothesis testing. Often, hypotheses of interest can be 
expressed in terms of linear combinations of the parameters as 


RG, = r; 


where R is a given q x k matrix and r is a given q x 1 vector that, through 
R8,= r, specify the hypotheses of interest. For example, if the hypothesis 
is that the elements of B, sum to unity, R =(1,...,1] andr =1. 

Several different approaches can be taken in computing a statistic to test 
the null hypothesis R8, ,= r versus the alternative RG,# r. The methods 
that we consider here involve the use of Wald, Lagrange multiplier, and 
quasi-likelihood ratio statistics. 

Although the approaches to forming the test statistics differ, the way 
that we determine their asymptotic distributions is the same. In each case 
we exploit an underlying asymptotic normality property to obtain a statis- 
tic distributed asymptotically as chi-squared (x?). To do this we use the 
following results. 


Lemma 4.27 Let g : RE — R! be continuous on RE and let b, = Z,a 
k x1 random vector. Then g(bn) > g(Z). 


Proof. See Rao (1973, p. 124). = 
Corollary 4.28 Let Vie ae N(0,1I,,). Then 
br, Vi bn = b, VR V; bn ~ xk, 


where x? is a chi-squared random variable with k degrees of freedom. 
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Proof. By hypothesis, Vin bn aaa N(0,1,,). The function g(z) = 
z'z is continuous on R*. Hence, 


bz bn = (Vba) => g(Z) = Z'Z ~ x. 


Typically, V» will be unknown, but there will be a consistent estimator 
V,, such that V, — Vn —> 0. To replace V,, in Corollary 4.28 with Vz, 
we use the following result. 


Lemma 4.29 Let g : RF — R! be continuous on R*. If an — bn 2, 0 and 
bn -> Z, then g(an) — g(bn) => 0 and g(an) > g(Z). 


Proof. Rao (1973, p. 124) proves that g(an)—-g(bn) -> 0. That g(an) £, 
g(Z) follows from Lemmas 4.7 and 4.27. m 


Now we can prove the result that is the basis for finding the asymptotic 
distribution of the Wald, Lagrange multiplier, and quasi-likelihood ratio 
tests. 


Theorem 4.30 Let W bp a N(0,I,), and suppose there exists V,, 

positive semidefinite and symmetric such that Vi-—Vn = 0, where Vy, is 

O(1), and for alln sufficiently large, det(V n) > 6 > 0. Then b Vl bn o 
2 

Xx z 


Proof. We apply Lemma 4.29. Consider Vib, a V where Ve 
exists in probability for all n sufficiently large. Now 


EN yy 


Vi bn — Va? ba= (V 
By hypothesis Vib, x N(0,1I,,) and Ve OV ei o0 by Propo- 
sition 2.30. It follows from the product rule Lemma 4.6 that View “bn — 
Vi? bn > 0. Because V} bn say ATS N(0,I,.), it follows from Lemma 
4.29 that b! Vz!b, Ó x2. m 
The Wald statistic allows the simplest analysis, although it may or may 
not be the easiest statistic to compute in a given situation. The motivation 
for the Wald statistic is that when the null hypothesis is correct, RG, 
should be close to RS, = r, so a value of RG,, — r far from zero is evidence 
against the null hypothesis. To tell how far from zero RG, — r must be 
before we reject the null hypothesis, we need to determine its asymptotic 
distribution. 
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Theorem 4.31 (Wald test) Let the conditions of Theorem 4.25 hold and 
let rank(R) =q < k. Then under Ho: RB, =r, 


i) T7 (RB, — r) Å N(O, I), where 


lr, = RD, R’ = RM} V, M7R. 
(ii) The Wald statistic Wn =n(R@, —r)'T, (RB, — r) Ê x2, where 
În =RD,R’ = R(X’X/n)7!V,(X'X/n)7!R’. 
Proof. (i) Under Hp, RG,, -r = R(G,, — Bo), so 
r- (RB, — = T VRD! D7 =1/2 /n(B,, - B, i 


It follows from Corollary 4.24 that T7? /n(RÊ„ — r) & ~ N(0,1). 

(ii) Because Dn — D, => 0 from Theorem 4.25, it follows from Propo- 
sition 2.30 that Î„ — Tn —> 0. Given the result in (2), (ii) follows from 
Theorem 4.30. m 


This version of the Wald statistic is useful regardless of the presence of 
heteroskedasticity or serial correlation in the error terms because a consis- 
tent estimator (Vn) for Vn is used in computing Î„. In the special case 
when V,, can be consistently estimated by 2(X'X/n), the Wald test has 
the form 


Wn = n(RG,, — r){R(X’X/n) R]1(RB,, — r)/ô?, 


which is simply q times the standard F-statistic for testing the hypothesis 
RG, =r. The validity of the asymptotic x distribution for this statistic 


depends crucially on the consistency of V, = ¢2(X'X/n) for Vn; if this 
V,, is not consistent for Vn, the asymptotic distribution of this form for 
Wan is not x? in general. 

The Wald statistic is most convenient in situations in which the restric- 
tions RZ, = r are not easy to impose in estimating 3,. When these re- 
strictions are easily imposed (say, RG, =r specifies that the last element 
of B, is zero), the Lagrange multiplier statistic is more easily computed. 

The motivation for the Lagrange multiplier statistic is that a constrained 
least squares estimator can be obtained by solving the problem 


min(Y —X8)'(¥—XB)/n, st. RB =r, 
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which is equivalent to finding the saddle point of the Lagrangian 
L=(Y — XB) (Y —-XB)/n+ (RB -rYA 
The Lagrange multipliers A can be thought of as giving the shadow 


price of the constraint and should therefore be small when the constraint is 
valid and large otherwise. (See Engle, 1981, for a general discussion.) The 
Lagrange multiplier test can be thought of as testing the hypothesis that 
A= 0. 
The first order conditions are 
OL/0B = 2(X'X/n)B-—2X'Y/n+R'A=0 
OL/OX = RG-r=0 
To solve for the estimate of the Lagrange multiplier, premultiply the first 
equation by R(X’X/n)~! and set RG = r. This yields 
Xn = 2(R(X’'K/n)!R’)-(RB,, - r) 
Ön = Bn- (X'X/n)R'Än/2, 
where /3,, is the constrained least squares estimator (which automatically 


satisfies RB, = r). In this form, Än is simply a nonsingular transformation 
of RÊ, — r. This allows the following result to be proved very simply. 


Theorem 4.32 (Lagrange multiplier test) Let the conditions of The- 
orem 4.25 hold and let rank(R) =q < k. Then under Ho: RG, =r, 


(i) A372 nd, © N(0,1), where 
A, =4(RM,'R’)"'T,(RM;'R’)7! 
and Tn is as defined in Theorem 4.81. 
(it) The Lagrange multiplier statistic CM, = nà Â, “Xn A x2, where 
A, = 4(R(X'X/n)7!R’)"!R(X'X/n)7!V,(X'XK/n)7!R! 
x(R(X’X/n)7'R’)7! 


and Vn, is computed from the constrained regression such that Vn — 
Vn 2, 0 under Ho. 


Proof. (i) Consider the difference 


AD? Vnàn — 2A71/2(RM='R’)-!/n(RB,, — r) 
= 2A; P (R(X'X/n) R’) - (RM; RATT Vn RB, —7). 
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From Toro en 4.31, T712 (RÊ, — r) A N(0,1). Because (X’X/n) — 
M,, -=> 0, it follows from Proposition 2.30 and the fact that An 1/2 and 
T=”? are O(1) that An P (R(X'X/n) IR’! - (RM, RYTY 2, 0. 
Hence by the product rule Lemma 4.6, 

An? nàn — 247 /?(RM;' R’) Vn(RB, ~ r) > 0. 


It follows from Lemma 4.7 that An 1/ * dn has the same asymptotic distri- 


bution as 2An// RM}; IR’) -1 /n(RÊn — r). It follows immediately from 
Corollary 4.24 that 2A? (RMZ'R’)-!/n(R8B„ — r) © N(0,I); hence 
Ant? adn £ N(0,1). 

(ii) Because Vn — Vn — 0 by hypothesis and (X’X/n) -Mn 2, 0, 
Än—An —> 0 by Proposition 2.30, given the result in (i), (ii) follows from 
Theorem 4.30. m 

Note that the Wald and Lagrange multiplier statistics would be identical 
if V„ were used in place of Vn. This suggests that the two statistics should 
be asymptotically equivalent. 

Exercise 4.33 Prove that under the conditions of Theorems 4.31 and 4.32, 
Wn -LMn = 0. 

Although the fact that A,, is a linear combination of R#,,—r simplifies the 
proof of Theorem 4.32, the whole point of using the Lagrange multiplier 
statistic is to avoid computing §,, and to compute only the simpler £,,. 
Computation of G,, is particularly simple when the data are generated as 
Y = X), + X28. + € and Ho specifies that B, (a q x 1 vector) is zero. 
Then 


R =| 


0: 7 El r=0, 
(qx (k—q)) (9xq) (4x1) 
and Ën = (in, 0) where 31, = (X41X1) X4 Y. 
Exercise 4.34 Define é = Y — Xi Ên. Show that under Ho: Bz = 0, 
Än = 2X(I—Xi(X}X1) 1X} )e/n 
= 2X5é/n. 
(Hint: RB, —r = R(X’/X/n)-!X'(Y — XB,,)/n). 


By applying the particular form of R to the result of Theorem 4.32 (ii), 
we obtain 


CM, = nd,,[(—X$X1(X1X1)7) : E,)Va(—XbX1(KX1)7! : a /4. 
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When V, can be consistently estimated by V,, = 62(X’X/n), where a? = 
é’E/n, the £M,, statistic simplifies even further. 


Exercise 4.35 If 62(X'X/n)— Vn => 0 and B, = 0, show that LM, = 
né'X(X'X)—1!X’é/(E'E), which is n times the simple R? of the regression 
of € on X. 


The result of this exercise implies a very simple procedure for testing 
3. = 0 when V,, = 02M,,. First, regress Y on X; and form the constrained 
residuals €. Then regress € on X. The product of the sample size n and the 
simple R? (i.e., without adjustment for the presence of a constant in the 
regression) from this regression is the £M,, test statistic, which has the 
X distribution asymptotically. As Engle (1981) showed, many interesting 
diagnostic statistics can be computed in this way. 

When the errors €; are scalar i.i.d. N(0, 02) random variables, the OLS 
estimator is also the maximum likelihood estimator (MLE) because B, 
solves the problem 


max £(3,0; Y) = exp|-n log V2r — n logo — ; SoM =X) B)*/o"|; 
t=1 


where £(3,0; Y) is the sample likelihood based on the normality assump- 
tion. When ez is not i.i.d. N(0, 02), B,, is said to be a quasi-maximum 
likelihood estimator (QMLE). 

When B, is the MLE, hypothesis tests can be based on the log-likelihood 
ratio 


ee É e 


L(Bnôn; Y) 
where 62 = n7! Famn- X/,,)? as before and Ĝ„, Ön solve 
max L(8,0; Y), s.t. RG =r. 


It is easy to show that Ĝĝ, is the constrained OLS estimator and G2 = &’é/n 
as before. The likelihood ratio is nonnegative and always less than or equal 
to 1. Simple algebra yields 


LR = (n/2) log( 6? /6?). 
Because G? = 6? + (Ên — B,)/(X'X/n)(B,, — Ên) (verify this), 
LRn = —(n/2) log[1 + (Ên — Bn)'(X'X/2)(Bn — Bn)/F]: 
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To find the asymptotic distribution of this statistic, we make use of the 
mean value theorem of calculus. 


Theorem 4.36 (Mean value theorem) Let s : R — R be defined on 
an open conver set © C RÒ such that s is continuously differentiable on © 
with k x 1 gradient Vs. Then for any points @ and Oo € © there exists @ on 
the segment connecting O and O, such that s(0) = s(@,) + Vs(6)'(0 — 8%). 


Proof. See Bartle (1976, p. 365). m 


For the present application, we choose s(9) = log(1+6). If we also choose 
Go = 0, we have s(@) = log(1)+ (1/(1+6))@ = i where 6 lies between 
0 and zero. Let 6, = (Bn — Br )'(X’X/n)(Bn — B,)/62 so that under Ho, 


[On| < |Q,{ — 0; hence 6, — 0. Applying the mean value theorem now 
gives 


LRn = —(n/2)(1+ On) (Bn — Bn)’ (%'X/n)(B, — Bn) /82. 
Because (1+ 6,)~! 1, it follows from Lemma 4.6 that 
—2LRy — (By — B,)'(%'X/n)(B, — B,)/62 > 0, 
provided the second term has a limiting distribution. Now 
By — Bn = (X'X/n) R'(R(X'X/n) R’) (R8, — r). 
Thus 
~22R, —n(RBG,, — r)'[R(X'X /n) R’ (RÊ, —r)/62 > 0. 


This second term is the Wald statistic formed with V, = 62(X’X/n), so 
—2LR, is asymptotically equivalent to the Wald statistic and has the x? 
distribution asymptotically, provided a? (X' X/n) is a consistent estimator 
for V,,. If this is not true, then —22R, does not in general have the Xe 
distribution asymptotically. It does have a limiting distribution, but not a 
simple one that has been tabulated or is easily computable (see White, 1994, 
Ch. 6, for further details). Note that it is not violation of the normality 
assumption per se, but the failure of V, to equal oĉ2M, that results in 
—2£R,, not having the x? distribution asymptotically. 

The formal statement of the result for the CR, statistic is the following. 


Theorem 4.37 (Likelihood ratio test) the conditions of Theorem 

4.25 hold, let rank(R) = q < k, and let 62(X’X/n) — Vn => 0. Then 
A 

under Hp: RG, = r, —2LRn ~ x2. 
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Proof. Set V„ in Theorem 4.31 to V, = 6? (X/X/n). Then from the 
argument preceding the theorem above —2LR»,—Wn => 0. Because W, Ee 
x2, it follows from Lemma 4.7 that —2LR, > x2. m 


The mean value theorem just introduced provides a convenient way to 
find the asymptotic distribution of statistics used to test nonlinear hy- 
potheses. In general, nonlinear hypotheses can be conveniently represented 
as 


where s: R* — R? is a continuously differentiable function of £. 


Example 4.38 Suppose Y = X18; + X262 + X363 + €, where Xi, Xo, 
and X3 are n x 1 and ĝi, b2, and B3 are scalars. Further, suppose we 
hypothesize that B3 = 8,82. Then s(GB,) = 3 — 81G2 = 0 expresses the 
null hypothesis. 


Just as with linear restrictions, we can construct a Wald test based on the 
asymptotic distribution of s(G.,); we can construct a Lagrange multiplier 
test based on the Lagrange multipliers derived from minimizing the least 
squares (or other estimation) objective function subject to the constraint; 
or we can form a log-likelihood ratio. 

To illustrate the approach, consider the Wald test based on s(ĝ„). As 


a 


before, a value of s(G,,) far from zero is evidence against Ho. To tell how far 


a 


s(8B„) must be from zero to reject Ho, we need to determine its asymptotic 
distribution. This is provided by the next result. 


Theorem 4.39 (Wald test) Let the conditions of Theorem 4.25 hold and 
let rank Vs(B,) =q < k, where Vs is the k x q gradient matriz of s. Then 
under Ho: s(B,) = 0, 

(i) Ty? /ns(G,,) Es N(0,1), where 


Pn = Vs(B,)'DnVs(8,). 
(ii) The Wald statistic W, = ns(8,,)'T7!s(G,,) A x2, where 


Fy = Val 


( 


)'DrVs(B,,) 


Vs 
V8(8,,)!(X'X/n)“V,,(X"X/n)“!V8(8,). 


I 


By 
Bn 
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Proof. (i) Because s() is a vector function, we apply the mean value 
theorem to each element s;(8), i =1,... ,q, to get 


si(Bn) = si(B,) + Vei(B) (Bn — Bo), 


where B® is a k x 1 vector lying on the segment connecting Ban and 8». 
The superscript (i) reflects the fact that the mean value may be different 
for each element s;(8) of s(8). 
Under Ho, si(B,) =0,7=1,... ,q, so 
Vasila) = VsilBy’)'ValB, — Bo). 


This suggests considering the difference 


vns. (r Pn) - Vs: (Bo) Vn(B,, — Bo) 
(Vs: (BË) — Vs: (Bo) Vn(Bn — Bo) 
(Vsi (BP) — Vs:(Bo) DD; /n(B,, — Bo). 


By Theorem 4.25, D; /n(B,, — B,) a N (0,1). Because B,, > G,, it 
ia 
(8 


il 


follows that BË 3 oe B, so Vsi (B® Vs;(8,) — 0 by Proposition 2.27. 


Because D}? is O(1), we have (Vs; B“ Bye Vs;(3,))’Di/” > 0. It follows 
from Lemma 4.6 that 


Vnsi(B,) ~ Vsi(B,)'Vn(B, — Bo) = 0, i=1,...,¢. 
In vector form this becomes 
Vns(B,) — Vs(B,)' Van — Bo) = 0, 
and because T7 !/? is O(1), 
D1? Yns(f,,) —T5'?Vs(8,)' Vn(B, — Bo) + 0 
Corollary 4.24 immediately yields 
Pr, /?Vs(8,)'Vn(B,, — Bo) ~ N(0, 1), 


so by Lemma 4.7, T}? vns(B,,) À N(0,1). 

(ii) Because D„—-D„ > 0 from Theorem 4.25 and Vs(6,,)-Vs(G,) Es 
0 by Proposition 2.27, 1, —In —> 0 by Proposition 2.30. Given the result 
in (i), (ii) follows from Theorem 4.30. m 
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Note the similarity of this result to Theorem 4.31 which gives the Wald 
test for the linear hypothesis RG, = r. In the present context, s(G,) plays 
the role of RG, — r, whereas Vs(G,)’ plays the role of R in computing the 
covariance matrix. 


Exercise 4.40 Write down the Wald statistic for testing the hypothesis of 
Example 4.38. 


Exercise 4.41 Give the Lagrange multiplier statistic for testing the hy- 
pothesis Ho: s(3,) = 0 versus Hy: s(B,) # 0, and derive its limiting dis- 
tribution under the conditions of Theorem 4.25. 


Exercise 4.42 Give the Wald and Lagrange multiplier statistics for testing 
the hypotheses RB, = r and s(3,) = 0 on the basis of the IV estimator 
Bn and derive their limiting distributions under the conditions of Exercise 


4.26. 


4.3 Asymptotic Efficiency 


Given a class of estimators (e.g., the class of instrumental variables esti- 
mators), it is desirable to choose that member of the class that has the 
smallest asymptotic covariance matrix (assuming that this member exists 
and can be computed). The reason for this is that such estimators are ob- 
viously more precise, and in general allow construction of more powerful 
test statistics. In what follows, we shall abuse notation slightly and write 


avar(G,,) instead of avar (vaD — Ba): A definition of asymptotic effi- 
ciency adequate for our purposes here is the following. 


Definition 4.43 Let P be a set of data generating processes such that 
each data generating process P° € P has a corresponding coefficient vec- 
tor B° € RF and let E be a class of estimators {Bn} such that for each 
P? € P, De? y/n (Bn - 8°) a N(0,1) for {avar°(,,) = D°} non- 
stochastic, O(1) and uniformly nonsingular. Then {87} € E is asymptot- 
ically efficient relative to {Bn} E E for P if for every P° € P the matriz 
avar°(G,,) — avar? (B7) is positive semidefinite for all n sufficiently large. 
The estimator {37} € E is asymptotically efficient in E for P if it is asymp- 
totically efficient relative to every other member of its class, E, for P. 


We write 8° instead of B, to emphasize that 8° corresponds to (at least) 
one possible data generating process (P°) among many (P). 
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The class of estimators we consider is the class € of instrumental variables 
estimators 


B,, = (X’ZP,,Z'X)"!X’ZP,,2Z’Y, 


where different members of the class are defined by the different possible 
choices for P, and Z. For any data generating process P° satisfying the 
conditions of Exercise 4.26, we have that the asymptotic covariance matrix 


of B, is 


D,= (Q Pan) QR Pn Vn Pnn (Qr Pnn). 


We leave the dependence of D, (and other quantities) on P° implicit for 
now, and begin our analysis by considering how to choose P„ to make Dy, 
as small as possible. : 

Until now, we have let P,, be any positive definite matrix. It turns out, 
however, that by choosing P, = Vz ! one obtains an asymptotically ef- 
ficient estimator for the class of IV estimators with given instrumental 
variables Z. To prove this, we make use of the following proposition. 


Proposition 4.44 Let A and B be positive definite matrices of order k. 
Then A — B is positive semidefinite if and only if B~! — A`! is positive 
semidefinite. 


Proof. This follows from Goldberger (1964, Theorem 1.7.21, p. 38). m 


This result is useful because in the cases of interest to us it will often be 
much easier to determine whether B~! — A7! is positive semidefinite than 
to examine the positive semidefiniteness of A — B directly. 


Proposition 4.45 Given instrumental variables Z, let P be the collection 
of probability measures satisfying the conditions of Ezercise 4.26. Then the 
choice P,, = V7! gives the IV estimator 


BX = (X' ZV ZX) XLV, ZY, 


which is asymptotically efficient for P within the class E of instrumental 
variables estimators of the form 


Bn = (X'ZP,,2Z’X)-!X’ZP,,2’Y. 
Proof. From Exercise 4.26, we have 


avar(87,) = (Qn Vn Qn)". 
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From Proposition 4.44, avar(@,,) — avar(@*) is p.s.d. if and only if 
(avar((9;,))~* — (avar(,))~” 


is p.s.d. Now for all n sufficiently large, 


(avar(;,))~* — (avar(8))~? 
= Qp, Vz Qn-Qr Pnn (Q, Pn Vn Pnn) Q,PrQn 
= QV," 
-V Prap Pa VV Pn Qn) Qn Pn VaV Qn 
= QVI- Gp (G, Gn) G) V; Qn, 
where Gn = Vn ds Pa Qn. This is a quadratic form in an idempotent matrix 
and is therefore p.s.d. As this holds for all P° in P, the result follows. m 


Hansen (1982) considers estimators for coefficients 3, implicitly defined 
by moment conditions of the form E(g(X:z, Yt, Zt, 3,)) = 0, where g is 
an l x 1 vector-valued function. Analogous to our construction of the IV 
estimator, Hansen’s estimator is constructed by attempting to make the 
sample moment 


n! X g(X+ Ye, Zt, 3) 
t=1 
as close to zero as possible by solving the problem 
1 


n7! X g(X:, Y, Zi, B) 


t=1 


min nS" g(X:, Ye, Ze, B) Ê, 
t=1 


Hansen establishes that the resulting “method of moments” estimator is 
consistent and asymptotically normal and that the choice P,, = Vk 


Viv. ae 0, where 


n 
ee aC >S g(X:, Yt, Z:, Bo)) 

t=1 
delivers the asymptotically efficient estimator in the class of method of 
moment estimators. Hansen calls the method of moments estimator with 
P,, = Vz! the generalized method of moments (GMM) estimator. 

Thus, the estimator 37, of Proposition 4.45 is the GMM estimator for 

the case in which 


e(Xz, Y, Zi, Bo) E Z(Y; = X; bo). 
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We call 8% a “linear” GMM estimator because the defining moment con- 
dition E(Z:(¥; — X48,)) = 0 is linear in (,. 


Exercise 4.46 Given instrumental variables X, suppose that 
Vi? YO Xe, © N(0,1), 


where Vn = 02M,,. Show that the asymptotically efficient IV estimator is 
the least squares estimator, 3,,, according to Proposition 4.45. 


Exercise 4.47 Given instrumental variables Z, suppose that 
Vi? Y Zie, © N(0,1), 


where Vn = o?L,„ and Ln = E(Z'Z/n). Show that the asymptotically 
efficient IV estimator is the two-stage least squares estimator 


Posts = (X!Z(Z'Z)~* ZX) X'Z(Z'Z) Z/Y 
according to Proposition 4.45. 


Note that the value of o? plays no role in either Exercise 4.46 or Exercise 
4.47 as long as it is finite. In what follows, we shall simply ignore o2, and 
proceed as if a? = 1. 

If instead of testing the restrictions s(G,) = 0, as considered in the pre- 
vious section, we believe or know these restrictions to be true (because, 
for example, they are dictated by economic theory), then, as we discuss 
next, we can further improve asymptotic efficiency by imposing these re- 
strictions. (Of course, for this to really work, the restrictions must in fact 
be true.) 

Thus, suppose we are given constraints s(G,) = 0, where s : R* — R? is 
a known continuously differentiable function, such that rank(Vs(G,)) =q 
and Vs(8,) (k x q) is finite; The constrained instrumental variables esti- 
mator can be found as the solution to the problem 


min(Y ~ XB)'ZP,Z'(Y — Xß), s.t. s(8) = 0, 


which is equivalent to finding the saddle point of the Lagrangian 
L= (Y — XB)'ZP,,Z'(Y — XB) +s(B)’d. 
The first-order conditions are 
OL 
op 
OL 
OX 


2(X’/ZP,,Z’X)3 — 2X'ZP,,Z'Y + Vs(B)A = 0 


= s(B)=0. 
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Setting B, = (X’ZP,,Z'X)"'X’ZP,Z'Y and taking a mean value expan- 
sion of s(G) around s((3) yields the equations 


a = 2(X’ZP,,Z'X)(6 — Bn) + Vs(B)A = 0, 
= = s(3,,) + V8'(B — Bn) = 0, 


where Vs’ is the q x k Jacobian matrix with ith row evaluated at a mean 
value Be . To solve for A in the first equation, premultiply by Vs’(X’Z 


P,,Z'X )~ to get 
2V5(8 — Bn) + Vs'(X'ZP,,Z'X)~!Vs(B)A = 0, 
substitute —s(3,,) = Vs'(3 — G,,), and invert Vs’(X'ZP,,Z’X)~!Vs(8) to 
obtain 
à = 2[Vs'(X'ZP,,Z'X)!Vs(B)]718(B,,). 
The expression for 0£/0B above yields 
B- Bn = —(X’ZP,,Z'X)~1Vs(B)A/2 
so we obtain the solution for 8 by substituting for A: 
B = Bn — (X'ZP,Z'X)~!Vs(B)[Vs' (X’ZP,Z'X)—!Vs(8)|7'!s(G,). 


The difficulty with this solution is that it is not in closed form, because the 
unknown ĝ appears on both sides of the equation. Further, appearing in 
this expression is VS, which has q rows each of which depends on a mean 
value lying between 3 and 2,,. 

Nevertheless, a computationally practical and asymptotically equivalent 


result can be obtained by replacing Vs and Vs(@) by Vs(G,,) on the right- 
hand side of the expression above, which yields 


Bi = By —(X/ZP,2Z'X)"'VsG,)|Vs(B,,)’ 
x(X’ZP,Z'X)1Vs(8,,)]-18(B,). 


This gives us a convenient way of computing a constrained IV estimator. 
First we compute the unconstrained estimator, and then we “impose” the 
constraints by subtracting a “correction factor” 


(X’ZP,, ZX)! Vs(8,,)[Vs(G,,)'(X'ZPnZ’X)!Vs(G,,)]~'s(G,,) 
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from the unconstrained estimator. We say “impose” because 37, will not 
satisfy the constraints exactly for any finite n. However, an estimator that 
does satisfy the constraints to any desired degree of accuracy can be ob- 
tained by iterating the procedure just described, that is, by replacing Bn 
by @* in the formula above to get a second round estimator, say, 37". 
This process could continue until the change in the resulting estimator was 
sufficiently small. Nevertheless, this iteration process has no effect on the 
asymptotic covariance matrix of the resulting estimator. 


Exercise 4.48 Define 
Bx* = B4 —(X’'ZP,,Z'X)~'Vs(B*) 
x [V5(82)/(X'ZP,,2'X)-Vs(87))-18(82). 
Show that under the Conan of Exercise 4.26 and the conditions on s 


that ,/n(B%* — B*) = 0, so that \/n(B* — B,) has the same asymptotic 


B;,) 
distribution as \/n(B** — Bo). (Hint: Show that /ns(B*) > 0.) 
Thus, in considering the effect of imposing the restriction s(8,) = 0, we 
can just compare the asymptotic behavior of B% to B, 
As we saw in Proposition 4.45, the asymptotically efficient IV estimator 


takes P, = V7!. We therefore consider the effect of imposing constraints 
on the IV estimator with P, = V} z 


Theorem 4.49 Suppose the conditions of Exercise 4.26 hold for P, = 
V>! and that s : RE — R° is a continuously differentiable function such 
that s(B,) = 0, Vs(8,) is finite and rank(Vs(G,)) = q. Define Ban = 
(X’ ZV. `Z X) X’ ZV. “ZY and define B% as the constrained IV esti- 


mator above with Ê„ = Q, Then 
avar(B„) — avar( 8%) 
= avar(3,,)Vs(8,)(Vs(B,)'avar(B,,)Vs(B,)]* VS(8,)'avar( 8n), 
which is a positive semidefinite matriz. 
Proof. From Exercise 4.26 it follows that 
avar(Bn) = (QL V7’ Qn)! 


Taking a mean value expansion of s(,,) around 8, gives s(B.,) = s(3,) + 
V5s'(3,, — Bo), and because s(G,) = 0, we have s(@,,) = V8'(G,, — Bo). 


Substituting this in the formula for 87 allows us to write 


Vn By a Pa) E A,Vn(B,, a Ba); 
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where 
A, = I-(X’2V, 2X) "Vs(G,)[VsG,) 
x(X/ZV,, ZK) Vs(B,,)\ 7! V8 
Under the conditions of Exercise 4.26, Bn =a B, and Proposition 2.30 
applies to ensure that Än — An —~ 0, where 


A, = I —avar(8,,)Vs(3,)[Vs(G,)’avar(G,,)Vs(By)]~* Vs(B,)’. 
Hence, by Lemma 4.6, 


Vn(B* =.) Anvn( B; =a) SAn = An) Vn(B, B Bo) => 


because vn(ĝ8,„ — (35) is Op(1) as a consequence of Exercise 4.26. From the 
asymptotic equivalence Lemma 4.7 it follows that \/n(G;, — Bo) has the 
same asymptotic distribution as A,./n(G,, — Bo). It follows from Lemma 
4.23 that \/n(G;, — Bo) is asymptotically normal with mean zero and 


avail =f = A,avar(G,,)A‘. 
Straightforward algebra yields 


avar(3") = avar((3,,) — avar(3,,)Vs(G,) i 
x[Vs(B, )avar(G,, )Vs(B,)]~’Vs(G,)’avar(G,,), 


and the result follows immediately. m 


This result guarantees that imposing correct a priori restrictions leads 
to an efficiency improvement over the efficient IV estimator that does not 
impose these restrictions. Interestingly, imposing the restrictions using the 
formula for B% with an inefficient estimator 8, for given instrumental vari- 
ables may or may not lead to efficiency gains relative to Ba 

A feature of this result worth noting is that the asymptotic distribution 
of the constrained estimator 37, is concentrated in a k — q-dimensional 

* 


subspace of R*, so that avar(8ž) will not be nonsingular. Instead, it has 
rank k — q. In particular, observe that 


V/n(B% — Bo) = /nAn(B, — Bo): 
But 
Vs(8,) An 


Vs(8,)' (I - avar(8,,)Vs(8,) 


x[Vs(,)'avar(B,,)Vs(8,)|""Vs(8,) ) 


= Vs(8,)’ L va(8,y 
= 0. 
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Consequently, we have 


avar(/n Vs(8,) (B7 — Bo)) = 9. 


An alternative way to improve the asymptotic efficiency of the IV es- 
timator is to include additional valid instrumental variables. To establish 
this, we make use of the formula for the inverse of a partitioned matrix, 
which we now state for convenience. 


Proposition 4.50 Define the k x k nonsingular symmetric matriz 


B Œ 
s-[6 D] 


where B is kı x ky, C is kg x ky and D is kg x ko. Then, defining E = 
D - CBC, 


TA B-!(I + C'E-'CB™!) —-B-!C’E"! 
T —~E-'CB™! E`! 


Proof. See Goldberger (1964, p. 27). m 


Proposition 4.51 Partition Z as Z = (Z, Z2) and let P be the collection 
of all probability measures such that the conditions of Exercise 4.26 hold 
for both Z, and Z2. Define Vin = E(Z\ee'Z,/n), and the estimators 


B, = (X’Z,V7{Z(X)'X’Z, VIZ yY, 
Bt = (X'ZV, ZX) XV ZY. 


n 


Then for each P° in P, avar(3,,)—avar(3* ) is a positive semidefinite matriz 
for alln sufficiently large. 


Proof. Partition Q, as Qi, = (Qin, Q5n), where Qin = E(Z1X/n), Qo, = 
E(Z,X/n), and partition V,, as 


The partitioned inverse formula gives 


y- a | Vint Vin En Vain) Vin -Vi Vin Ez? 
mr -E3 Van Vin E,’ i 
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where En = Von — VoinVi, Vion. From Exercise 4.26, we have 


avar(B,) = (Qn V Qin), 
avar(8%) = (QVR Qn). 


We apply Proposition 4.44 and consider 
-1 Se 
(avar(3;))  — (avar(B,)) 

= QVn'Qn - UAV Qin 

= Qin (Vin +V Vim En Vn Vin) Qin 
-Qun En Vn Vis Qin z Qina VE V2 Ez Qn 
+Q En Qon — Qin Vin Qin 

= (Qin Vin Vien — Qon) En’ (VainVin Qin — Qn). 


Because E-t is a symmetric positive definite matrix (why?), we can write 
Ez! = EEZ"? so that 


= ee fel 
(avar(B3)) — (avar(B,)) 
= (Qin Vin Vien ~ Qn) En ER (Van Vin Qin — Qian). 


Because this is the product of a matrix and its transpose, we immediately 
have (avar(3*))~' — (avar(3,,))~! is p.s.d. As this holds for all P° in P the 


result follows, which by Proposition 4.44 implies that avar(G,,) — avar( 8%) 
is p.s.d. B 

This result states essentially that the asymptotic precision of the IV 
estimator cannot be worsened by including additional instrumental vari- 
ables. We can be more specific, however, and specify situations in which 
avar( 8%) = avar(G,,), so that nothing is gained by adding an extra instru- 
mental variable. 
Proposition 4.52 Let the conditions of Proposition 4.51 hold. Then we 


have avar(,,) = avar( 8%) if and only if 
E(X'Z,/n)E(Ziee'Z1/n) t E(Z ee'Zo/n) — E(X'Ze/n) = 0. 


Proof. Immediate from the final line of the proof of Proposition 4.51. m 
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To interpret this condition, consider the special case in which 
E(Z'ee'Z/n) = E(Z'Z/n). 


In this case the difference in Proposition 4.52 can be consistently estimated 
by 


“(XZ (ZZ) Zi Zo — X'Z2) =n "X [Z (212) Zi — 1] Zo. 


This quantity is recognizable as the cross product of Z2 and the projection 
of X onto the space orthogonal to that spanned by the columns of Z1. If 
we write X = X(Z,(Z{Z,)~'Z', — I), the difference in Proposition 4.52 is 
consistently A estimated by X/Z2/n, so that avar((3,,) = avar(@*) if and only 
if X/Zo /n = 0, which can be interpreted as saying that adding Zo to the 
list of instrumental variables is of no use if it is uncorrelated with X, the 
matrix of residuals of the regression of X on Z. 

One of the interesting consequences of Propositions 4.51 and 4.52 is that 
in the presence of heteroskedasticity or serial correlation of unknown form, 
there may exist estimators for the linear model more efficient than OLS. 
This result has been obtained independently by Cragg (1983) and Cham- 
berlain (1982). To construct these estimators, it is necessary to find addi- 
tional instrumental variables uncorrelated with ez. If E(e:|X+) = 0, such 
instrumental variables are easily found because any measurable function 
of X, will be uncorrelated with e+. Hence, we can set Z; = (X},z(X+)’)’, 
where z(Xz) is a (l — k) x 1 vector of measurable functions of X+. 


Example 4.53 Let p = k = 1 so Y, = Xb, + €t, where Yı, Xz and ct 
are scalars. Suppose that X+ is nonstochastic, and for convenience suppose 
My =n YEA 2 + 1. Let e, be independent heterogeneously distributed 
such that E(e+) = 0 and E(e?) = o?. Further, suppose X, > 6 > 0 
for all t, and take z(X1) = X7' so that Z = (X+, X7 !)'. We consider 
B, = (X'X)-!X'Y and B*% = (X'ZV,, ZX)-!X/ZV_'Z'Y, and suppose 
that sufficient other assumptions guarantee that the result of Exercise 4.26 
holds for both estimators. By Propositions 4.51 and 4.52, it follows that 
avar(3,) > avar( 8%) if and only if 


n —1 n 
(= Sfx?) (n= Set) —-10 
t=1 t=1 


or equivalently, if and only ifn} ye ere, © Sa i 3 1 a. This would 
certainly occur if o4 = See . (Verify this using Jensen’s inequality.) 
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It also follows from Propositions 4.51 and 4.52 that when V, Æ Ln, there 
may exist estimators more efficient than two-stage least squares. If l > k, 
additional instrumental variables are not necessarily required to improve 
efficiency over 2SLS (see White, 1982); but as the result of Proposition 
4.51 indicates, additional instrumental variables (e.g., functions of Z+) can 
nevertheless generate further improvements. 

The results so far suggest that in particular situations efficiency may or 
may not be improved by using additional instrumental variables. This leads 
us to ask whether there exists an optimal set of instrumental variables, that 
is, a set of instrumental variables that one could not improve upon by using 
additional instrumental variables. If so, we could in principle definitively 
know whether or not a given set of instrumental variables is best. 

To address this question, we introduce an approach developed by Bates 
and White (1993) that permits direct construction of the optimal instru- 
mental variables. This is in sharp contrast to the approach taken when 
we considered the best choice for Pp. There, we seemingly pulled the best 
choice, V71, out of a hat (as if by magic!), and simply verified its optimal- 
ity. There was no hint of how we arrived at this choice. (This is a common 
feature of the efficiency literature, a feature that has endowed it with no 
little mystery.) With the Bates-White method, however, we will be able 
to see how to construct the optimal (most efficient) instrumental variables 
estimators, step-by-step. 

Bates and White consider classes of estimators € indexed by a parameter 
~ such that for each data generating process P° in a set P, 


Vn(On(y) — 8°) = Hi (Y) Vasa (Y) + ope (1), (4.1) 


where: ôn (7) is an estimator indexed by y taking values in a set T; 0° = 
6(P°) is the k x 1 parameter vector corresponding to the data generating 
process P°; H? (y) is a nonstochastic nonsingular k x k matrix depending 
on y and P°; s°(7) is a random k x 1 vector depending on y and P° such 
that I2(-y)~!/2,/ns? (y) alae N(0,1), where I? (y) = var°(./ns?(7)), u; 
denotes convergence in distribution under P°, and var® denotes variance 
under P°; and opo (1) denotes terms vanishing in probability-P°. Bates and 
White consider how to choose + in I to obtain an asymptotically efficient 
estimator. 

Bates and White’s method is actually formulated for a more general 
setting but (4.1) will suffice for us here. 

To relate the Bates—White method to our instrumental variables estima- 
tion framework, we begin by writing the process generating the dependent 
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variables in a way that makes explicit the role of P°. Specifically, we write 
YS Ap Pe telae 


The rationale for this is as follows. The data generating process P° governs 
the probabilistic behavior of all of the random variables (measurable map- 
pings from the underlying measurable space (Q, F)) in our system. Given 
the way our data are generated, the probabilistic behavior of the mapping 
E€ is governed by P°, but the mapping itself is not determined by P°. We 
thus do not give a ° superscript to Ez. 

On the other hand, different values 3° (corresponding to 8° in the Bates- 
White setting) result in different mappings for the dependent variables; we 
acknowledge this by writing Y?, and we view 8° as determined by P?, 
i.e., B° = B(P°). Further, because lags of Y? or elements of Y? itself 
(recall the simultaneous equations framework) may appear as elements of 
the explanatory variable matrix, different values for B° may also result in 
different mappings for the explanatory variables. We acknowledge this by 
writing X?. 

Next, consider the instrumental variables. These also are measurable 
mappings from {2,7} whose probabilistic behavior is governed by P°. As 
we will see, it is useful to permit these mappings to depend on P° as well. 
Consequently, we write Z?. Because our goal is the choice of optimal instru- 
mental variables, we let Z? depend on y and write Z?(-y) to emphasize this 
dependence. The optimal choice of ~y will then deliver the optimal choice of 
instrumental variables. Later, we will revisit and extend this interpretation 
of +. 

In Exercise 4.26 (i), we assume Vnl/?n-1/2Z’e -> N(0,I), where 
Vn = var(n~!/2Z’e). For the Bates-White framework with Z?(+¥), this 
becomes V2(y) T n12. (y)'e => N(0,1I), where V9(-y) = var?(n71/2 
Z°(7y)’e), and Z°(+) is the np x l matrix with blocks Z?(7)’. In (iii.a) we 
assume Z/X/n — Qn = 0,(1), where Qn = E(Z’X/n) is O(1) with uni- 
formly full column rank. This becomes Z°(-y)’K°/n ~ Q% = ope (1), where 
Q? = E° (Z°(y)'X? /n) is O(1) with uniformly full column rank; £°(-) de- 
notes expectation with respect to P°. In (iii.b) we assume P,, — Pn = 
op(1), where P,, is O(1) and uniformly positive definite. This becomes 
P,.(y) — P2(y) = 0p(1), where P9(-) is O(1) and uniformly positive def- 
inite. Finally, the instrumental variables estimator has the form 


Bala) = (X ZNP NZNX) XZP) OY.. 


We will use the following further extension of Exercise 4.26. 
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Theorem 4.54 Let P be a set of probability measures P° on (Q,F), and 
let’ be a set with elements y. Suppose that for each P° in P 


(i) Y? = XO’ +e, t=1,2,..., B8 ERF. 


For each y inT and each P° in P, let {Z?.(y)} be a double array of 
random p x | matrices, such that 


(ii) n712 7%, Z2, (yer — n/2m2(-y) = opo (1), where 
Ve (y) Y ?n!/?m2 (y) “> N(0, 1), 


and V? (y) = var?(n!/2m2(y)) is O(1) and uniformly positive defi- 
nite; 

(iii) (a) n? I2, Z2(Y)X? — Q2(7) = ope (1), where Q2(7) is O(1) with 
uniformly full column rank; ; 
(b) There exists P,(y) such that P,(y)—P°() = opo (1) and P? (y) = 
O(1) and is symmetric and uniformly positive definite. 


Then for each P? in P and y inT 
Oo — 7 N d? 
Di 7? Vn (50) ~ 8°) — N(0, D), 


where 


Do(y) = (Q2(7)/P2(7)Q2(7)) 
x QZ (YV PEVE (7) P21) Q2(-y) (Q2(7)'P2 (7) Q2(7)) 


and D? (y)! are O(1). 
Proof. Apply the proof of Exercise 4.26 for each P° in P andy inT. Em 


This result is stated with features that permit fairly general applications, 
discussed further next. In particular, we let {Z?,(y)} be a double array 
of matrices. In some cases a single array (sequence) {Z?(-y)} suffices. In 
such cases we can simply put mg (y) = n~1!Z°(-+)/e, so that V2(y) = 
var?(n—!/2Z°()’e), parallel to Exercise 4.26. In this case we also have 
Q? (y) = E°(Z°(y)'X°/n). These identifications will suffice for now, but 
the additional flexibility permitted in Theorem 4.54 will become useful 


shortly. Recall that the proof of Exercise 4.26 gives 


Vi (Bux) 8") = (RVA RM) 
XQ? (YV VEY) n! Z. (ye + op (1) 
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when we choose P,,(7) so that P? (y) = V2(7)7!, the asymptotically ef- 
ficient choice. Comparing this expression to (4.1) we see that 8,,(-y) cor- 
responds to @n(7), E is the set of estimators € = {{B,(y)}, yer} 8° 
corresponds to 8°, and we can put 


Haly) = Q2(¥)/Va(y)7*Q2(7) 
saly) = QVE) nZ. (ye 
In(y) = Haly). 


Our consideration of the construction of the optimal instrumental variables 
estimator makes essential use of these correspondences. 

The key to applying the Bates-White constructive method here is the 
following result, a version of Theorem 2.6 of Bates and White (1993). 


Proposition 4.55 Let Ôn(y) satisfy (4.1) for all y in a nonempty set T 
and all P° in P, and suppose there ezists -y* in T such that 


H; (7) = cov? (Vasa(y), Vnsn(*)) (4.2) 
for ally €T and all P° in P. Then for ally ET and all P° in P 
avar? (ôn (7)) — avar? (ôn) 
is positive semidefinite for all n sufficiently large. 


Proof. Given (4.1), we have that for all y in I, all P° in P, and all n 
sufficiently large 


avar? (n(y)) = Hay) RHO) 

oa (ôn) = IHG) E BAG) HS, 
so 

o 6 = o fa) ( *) = H?( Ve a )H? ( )-! — H°( Ha 
avar? {8,,(-y) } — avar? (On(Y nY: Lity) Ay nl 
We now show that this quantity equals 

var (H? (y) Vans, (Y) — H3 (y) 7 Vnse(7*)). 
Note that (4.2) implies 


vane ( a) )= (EO oS ) 
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So the variance of H?(-)~1./ns?(-y) — H? (7*) -1 /nse(y*) is given by 


Ha (V) "Tn (Hay)? + AR 9") Hg)" 
- Bh (y) Ha (HR) — Bi") Bn) Bay)? 
= Hay) TY) Bay)" — Bay") ?, 
as claimed. Since this quantity is a variance-covariance matrix it is positive 
semidefinite, and the proof is complete. @ 


The key to finding the asymptotically efficient estimator in a given class 
is thus to find -y* such that for all P° in P and y inT 


H9 (y) = cov? (vns (y), /ns?,(7")) - 


Finding the optimal instrumental variables turns out to be somewhat 
involved, but key insight can be gained by initially considering Z?(-) as 
previously suggested and by requiring that the errors {e+} be sufficiently 
well behaved. Thus, consider applying (4.2) with Z?(y). The correspon- 
dences previously identified give 


Q(y)/Valy) TERRA) = 
cov? (RVV n °Z(-y)'e, QI Val) n 2Ze(q" ye) : 


Defining C? (y, y*) = cov?(n~1/2Z?(y)'e,n—1/2Z°(-y*)/e), we have the sim- 
pler expression 


QRAN VANT RRA) = QRT 
xC (177) VRT RA"). 
It therefore suffices to find -y* such that for all y 
Qa lT) = Ca 11") VaT RRA"). 


To make this choice tractable, we impose plausible structure on et. 
Specifically, we suppose there is an increasing sequence of o-fields {Fi} 
adapted to {ez} such that for each P° in P {et, Ft} is a martingale dif- 
ference sequence. That is, E° (et|Ft-1) = 0. Recall that the instrumental 
variables {Z?} must be uncorrelated with e+. It is therefore natural to re- 
strict attention to instrumental variables Z?(-y) that are measurable-F;_, 
as then 


E°(Z2(y)e,) = aie °(Z°(-y)e,|Fr—-1)) 
E? (Z? (y)E° (et|Fe-1)) = 0. 
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Imposing these requirements on {e+} and {Z?()}, we have that 


CRIT) = cov? (nV? Z(y)e, nZ (qye) 


I 


n7! Y E° (Ze (yeezy) 
= n! = E? (E° (Z? (y)ere, Zz (y")'|Fr-1)) 


a a (Ze (y)E° (ere; |Ft_1)Z ey) 


li 


= a oe (Zo(y ) 28 Z(Y Dor 


where Q? = E° (erel|Fi—-1). The first equality is by definition, the second 

follows from the martingale difference property of e+, the third uses the law 

of iterated expectations, and the fourth applies Proposition 3.63. 
Consequently, we seek -y* such that 


nm! SO B?(22(q)X2) =n} 2 EP (ZN 8 RANVIR, 
t=1 
or, by the law of iterated expectations, 


n1 SB (E? (ZENK? Fi) 


t=1 


=n DE (Z°(y)E° (X2|Fe-1)) 
no! 3 (Zo(y Ke) 
=mi yE (ZEY) 27 Ze NVR Qn"); 


where we define X? = E° (X?|Fi_1). 
Inspecting the last equation, we see that equality holds provided we 
choose ~y* such that 


QRY Vay) Ze (ye = Xp. (4.3) 
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If Q? = E°(e,e;|Fz-1) is nonsingular, as is plausible, this becomes 
Qn) Va)? Ze (y") = XA! 
which looks promising, as both sides are measurable-F;_,. We must there- 
fore choose Z?(y*) so that a particular linear combination of Z?(y") equals 
XO, 
This still looks challenging, however, because -y* also appears in Q? (7*) 
and V°(-y*). Nevertheless, consider the simplest possible choice for Z°(7"), 


Z) = Reap. 
For this to be valid, it must happen that Q°(7y*)’ = V2(7*), so that 
Q? (7*)'V2(7*)~! = I. Now with this choice for Z?(7") 


Va = var? (eS zeae) 
t=1 


H 


n~! $ E° (Ze(y" Jere, Zz (7')’) 
t=1 
= nh) E (Z2(y")07Z2(7")’) 


t=1 


| 
3 

ty 

a 

~~ ~ Aan 

tet 

2 

| 

g 

< 


| 
3 
| 
by 
(e) 
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I} 


n7! XO E° (Zo (y*)E° (X?|Fe_-1)) 
t=1 


n~! $ E° (Zi (7")X2’') 
t=1 
= Q,(7"). 
The choice Z9(7") = X°N°-! thus satisfies the sufficient condition (4.2), 
and therefore delivers the optimal instrumental variables. 


Nevertheless, using these optimal instrumental variables presents appar- 
ent obstacles in that neither Q? = E°(ere}|7r_-1) nor X? = E°(X?|Fi_1) 
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are necessarily known or observable. We shall see that these obstacles can 

be overcome under reasonable assumptions. Before dealing with the general 

case, however, we consider two important special cases in which these ob- 

stacles are removed. This provides insight useful in dealing with the general 

case, and leads us naturally to procedures having general applicability. 
Thus, suppose for the moment 


E? (Erei Fe) = Q, t= 1,2, Seang 


where Qz is a known p x p matrix. (We consider the consequences of not 
knowing Q: later.) Although this restricts P, it does not restrict it com- 
pletely — different choices for P° can imply different distributions for e+, 
even though Q; is identical. 

If we then also require that X? is measurable-F;_; it follows that 


Xp = E°(Xe|Fe-1) 
t? 


so that X? = X? is observable. Note that this also implies 
E? (Y?|Fu_-1) = E? (K? B° + etlFt-1) = X? P’, 


so that the conditional mean of Y? still depends on P° through 8°, but we 
can view (2’) as defining a relation useful for prediction. This requirement 
rules out a system of simultaneous equations, but still permits X? to contain 
lags of elements of Y?. 

With these restrictions on P we have that Z°(y*) = X9N?7! = xen7! 
delivers a usable set of optimal instrumental variables. Observe that this 
clearly illustrates the way in which the existence of an optimal IV estimator 
depends on the collection P of candidate data generating processes. 

Let us now examine the estimator resulting from choosing the optimal 
instrumental variables in this setting. 


Exercise 4.56 Let X, be F:_1-measurable and suppose that Qı = E? (exe; | 
Fi_1) for all P° in P. Show that with the choice Z9(-y*) = X90" then 
Vala) = Eln! OL, XP07! Xe") and show that with Vn(y*) = n7} ya 
XQ X?’ this yields the GLS estimator 


Ba = (m Soxear'xe) 
t=1 


= (X'QX)  X'QY, 


1 n 
nS XR Y? 
t=1 
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where 2 is the np x np block diagonal matriz with p x p diagonal blocks Qt, 
t=1,...,n and zeroes elsewhere, and we drop the ° superscript in writing 
X and Y. 


This is the generalized least squares estimator whose finite sample effi- 
ciency properties we discussed in Chapter 1. Our current discussion shows 
that these finite sample properties have large sample analogs in a setting 
that neither requires normality for the errors €; nor requires that the re- 
gressors X? be nonstochastic. 

In fact, we have now proven part (a) of the following asymptotic efficiency 
result for the generalized least squares estimator. 


Theorem 4.57 (Efficiency of GLS) Suppose the conditions of Theorem 
4.54 hold and that in addition P is such that for each P° in P 


(i) Yoo XO +e, $= 12.2245 B? ERF, 
where {(X?,1,€t),F¢} is an adapted stochastic sequence with E° (e| 
Fia) =O) t=1, 2 and 


(ii) E° (eeil Ft-1) = Qi is a known finite and nonsingular p x p matriz 
$219, 23 


(iii) For each y ET, Z?(y) is Fu_1-measurable, t = 1,2,... , and Q (y) = 
no) J i= E(Z2(y) Xe’). 
Then 
(a) Ze(y*) = XQ! delivers 


n -1 n 
B, = (S xen xr) nS Xa Y, 


which is asymptotically efficient in E for P. 


(b) Further, the asymptotic covariance matriz of B}, is given by 


. ~1 

avar’ (8%) = ho NOE (EarxP) 
t=1 

which is consistently estimated by 


a -1 
Ô, = h- Yarar f 


t=1 


a o 
i.e., Dn — avar? (8%) Z 0. 
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We leave the proof of (b) as an exercise: 
Exercise 4.58 Prove Theorem 4.57 (b). 


As we remarked in Chapter 1, the GLS estimator typically is not feasible 
because the conditional covariances {92;} are usually unknown. Neverthe- 
less, we might be willing to model NQ, based on an assumed data generating 
process for €4€}, say 


E? (erei Fi) = h(W?, a’), 


where h is a known function mapping F—1-measurable random variables 
W? and unknown parameters @° into p x p matrices. Note that different 
P°’s can now be accommodated by different values for @°, and that W? 
may depend on P?. 

An example is the ARCH(q) data generating process of Engle (1982) in 
which h(W?, a°) = a8 + afe? +--+ ace? _, for the scalar (p = 1) case. 
Here, W? = (€?_1,-.. ,€?_4). The popular GARCH(1,1) DGP of Bollerslev 
(1986) can be similarly treated by viewing it as an ARCH(oo) DGP, in 
which case W? = (e€?_,,€?.9,...-). 

If we can obtain an estimator, say @,, consistent for @° regardless of 
which P° in P generated the data, then we can hope that replacing the 
unknown 2; = h(W?,@°) with the estimator Qn = h(W?, ĉn) might 
still deliver an efficient estimator of the form 


n -1 n 
B; = (mixaz) n YOK Ôa V2. 
t=1 t=1 
Such estimators are called “feasible” GLS (FGLS) estimators, because they 
can feasibly be computed whereas, in the absence of knowledge of Q4, GLS 
cannot. 
Inspecting the FGLS estimator, we see that it is also an IV estimator 
with 


Il 


a —1 
at X? Ont 
7 -1 
Ê, = RY) KÊ XY. 
t=1 
a —1 
Observe that Z?, = X?Q,, is now double subscripted to make explicit its 


dependence on both n and t. This double subscripting is explicitly allowed 
for in Theorem 4.54, and we can now see the usefulness of the flexibility 
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this affords. To accommodate this choice as well as other similar choices, 
we now elaborate our specifications of Z°, and y. Specifically, we put 


Zril Y) = Zto(Y2)C(Ze, (71); Yan): 


Now y has three components: -y = (71; Y2, %3), where 3 = {Y3n} is a se- 
quence of F;_,;-measurable mappings, Ç is a known function such that 
C (Zi (41) Yan) is a p x p matrix, and Z? (yı) and Zf(y2) are Fi-1- 
measurable for all y1, Y2, where Z? (yı) is an l x p random matrix. 

By putting Z?\ (y1) = X?, Zio(¥2) = W?, Yan = Gn, and ¢ = (h)~* 
obtain 


Zoly) = X?bh(W?, &n))~ : 
= KÊ. 


Many other choices that yield consistent asymptotically normal IV estima- 
tors are also permitted. 

The requirement that ¢ be known is not restrictive; in particular, we can 
choose Ç such that €(Z,,¥3n) = Y3n(z1) (the evaluation functional) which 
allows consideration of nonparametric estimators of the inverse covariance 
matrix, QT}, as well as parametric estimators such as ARCH (set 73,,(-) = 
[h(., €en)]2). 

To see what choices for ~y are permitted, it suffices to apply Theorem 
4.54 with Z°(y) interpreted as the np x l matrix with blocks Z°,(7)’. We 
expand P to include data generating processes such that 297! can be 
sufficiently well estimated by ¢€(Z}, (Y1), Y3n) for some choice i (Yi Y3). 
(What is meant by “sufficiently well” is that conditions (ii) and (iii) 
of Theorem 4.54 hold.) It is here that the generality afforded by Theo- 
rem 4.54 (ii) is required, as we can no longer necessarily take m? (y) = 

nor, Z2,(y)e,. The presence of Y3n and the nature of the function 
C may cause n~!/2 Y? j=l Z° (yee to fail to have a finite second moment, 
even though n7!/? Ja —1 Zge(y)et — n?m (q) = ope (1) for well behaved 

m? (y). Similarly, there is no guarantee that E°(Z°,(y)X9’) exists, al- 
though the sample average n`! J `;-1 Z2,(y)X?’ may have a well defined 
probability limit Q°(7). 

Consider first the form of m}. For notational economy, let y? = Z?,(7,) 
and Yt = Zfo(Y2). Then with 


Zeal) = Zto(Y2)0(ZA (M1), Yan) = 1126 (Yer Yan) 


104 4. Asymptotic Normality 


we have 


nV? ND (yet 
t=1 


mr NO YCC Yan)ee 


t=1 
= nY yatla, 3)et 
t=1 


tn! So yh (CY Yan) — CY ¥3)] Et, 
t=1 


where we view y3 as the probability limit of -y3,, under P°. If sufficient 
regularity is available to ensure that the second term vanishes in probability, 
i.e., 


n~ !/2 Soyo (ICCT Yan) — C(t Y3)] Et = Ope (1), 
t=1 
then we can take 
nme (y) =n X Cl V3) Ee: 


t=1 


We ensure “sufficient regularity” simply by assuming that I and P are 
chosen so that the desired convergence holds for all y € IT and P’ in P. 
The form of Q2 follows similarly. With Z2, (Y) = yeoC(¥21,Y3n) we have 


Yt2 6741 »Y3n ) xX? 


nS Zal)? = n 
t=1 = 


t=1 


= n S a6 (e178) X2 


t=1 
$n} SO y [CYA Yan) — CR V) X. 
t=1 
IfI and P are assumed chosen such that 


nS veo (Vir VIX? — nT! S Etla V8) X2] = Ope (1) 


t=1 t=1 


and 


n> yf EAR, Yan) — CYR, 73) XP = op2(1), 
t=1 
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then we can take 
o (y) =n $ E hatlar 73) XP]. 
t=1 
With V? (y) = var°(n!/2m2(-+7)), the Bates-White correspondences are 
now 
H, (7) Q) VaN RRCA) 
s(a) = Q VRO) m) 
Ra) = H0) 


(As before, we set P? = V271.) By argument parallel to that following 
Proposition 4.55, the sufficient condition for efficiency is 


QC) = C (Y Y) V8.9") 18 (4*), 


where now C3 (y, Y*) = cov? (n!/?m? (q), n?m? (7*)). 
Following arguments exactly parallel to those leading to Equation (4.3), 
we find that the sufficient condition holds if 


QRT) VR) ZRA (Ze, (Vi), 73)? = X?. 


Choosing Z?o(3) = X? and ¢(Z?, (ył), 73°) = 277 can be verified to en- 
sure Q? (y*)! = V2 (Ņ*) as before, so optimality follows. Thus, the optimal 
choice for Z°,(-y) is 


Il 


II 


* of! 
Za) = XK Qne: 


where Q; = €(Z°, (77), Yin) is such that (Z? (74), Y3°) = 997}. In par- 
ticular, if Q? = h(W?,a?), then taking ¢(2,,Y3n) = Y3n(Z1), Z (7i) = 
W? and 73° = [h(-,@œ°)]7! gives the optimal choice of instrumental vari- 
ables. 

We formalize this result as follows. 


Theorem 4.59 (Efficiency of FGLS) Suppose the conditions of Theo- 
rem 4.54 hold and that in addition P is such that for each P° in P 
(i) YP =X ten t=1,2..., B ERF, 
where {(X?.1,€t), Ft} is an adapted stochastic sequence with E°? (e| 
Fii = 0)4 = 152.6225 and 
(ii) E° (e:re;|Fe-1) = OP is an unknown finite and nonsingular px p matriz 
= ae 
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(iii) For each y ET, 


Zat (Y) = Z(Y) Y3n (Zi (71)), 


where y = (Y1, Y2, Y3), and Y3 = {Yzn} is such that 3, is F- 


measurable for all n = 1,2,..., and Z?,(71) and Zgo(Y2) are Fi-1- 
measurable for all 7, and yo, t = 1,2,...; and there erists y3 such 
that 


m? (y) =n! $ Zi (Y2)73(Zea(M))ee 


t=1 
and 
QRC) =n! Y Elia) (Ze ())X?):; 
t=1 
. * x Ô -1 * o * *O o * o— 
(iv) for some Yi, Y3, Qne = V3n(Zei(7i)), where y3° (Z (yi) = Q sf 
Then 
(a) Znt(¥*) = KÂ, delivers 

n = n 

a = —1 
p= (= xii: mo) X An Ye, 
t=1 t=1 


which is asymptotically efficient in E for P; 


(b) the asymptotic covariance matriz of B}, is given by 
zj 
avar” (87) = h- >E extar'x2)] ) 
t=1 
which is consistently estimated by 
x =j 
be Pe ae 
D, = h= bD Xi Qat xr! , 
t=1 
i.e., Dn — avar?(B*) S 0. 


Proof. (a) Apply Proposition 4.55 and the argument preceding the state- 
ment of Theorem 4.59. We leave the proof of part (b) as an exercise. Wm 
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Exercise 4.60 Prove part (b) of Theorem 4.59. 


The arguments needed to verify that 
n! D ana) — Y3) E = Opo (1) 
t=1 
and 
no yb [Yan (V1) — ¥3(¥e1)] X? = ope (1) 
t=1 


depend on what is known about Q? and how it is estimated. The next 
exercise illustrates one possibility. 


Exercise 4.61 Suppose that conditions 4.59 (i) and 4.59 (ii) hold with 
p= 1 and that 


C=O, FOGW,, CH 1 Qs, 


where W? = 1 during recessions and W? = 0 otherwise. 


(i) Let ê = Y? — X9'B,,, where B,, is the OLS estimator, and let Ĝ&n = 
(Qin, Ĝon) be the OLS estimator from the regression of è? on a con- 
stant and W?. Prove that Gn = a° + Op0(1), where a° = (aĵ,@3V, 
providing any needed additional conditions. 


(ii) Next, verify that conditions 4.59 (iii) and (iv) hold for 
Ziel") = X? (Gin + ĉn We). 


When less is known about the structure of Q?, nonparametric estimators 
can be used. See, for example, White and Stinchcombe (1991). The notion 
of stochastic equicontinuity (see Andrews, 1994a,b) plays a key role in such 
situations, but will not be explored here. 

We are now in a position to see how to treat the general situation in 
which X? = E(X°|F;_1) appears instead of X?. By taking 


Zat) =Van(Zi2(12))V3n (Zii (¥1)) 


we can accommodate the choice 


—1 
nt ? 


“nt (y) = KX 
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where Xnt = Y4n(Z%(Y2)) = Yan(¥22) is an estimator of X?. As we might 
now expect, this choice delivers the optimal IV estimator. 
To see this, we examine m7, and Q2. We have 


li 


n2 NO 22, (yer = nY Yanl) Yanl et 
t=1 


t=1 
= mY (Yi) 3 (Veet 
t=1 


tn “Yan (Yea) Yan Yer) = V3) BCE DNee 
t=—1 


= nlm? (7) + Op(1), 


provided we set mg (Y) = n~ D1 V4(Yi2)V9(y41 ee and 


n~’ S an (Yiz) Yn (Ya) — Vali) V5 lee = Ope (1). 


t=1 


Similarly, 
n! YO ZUNE? =n Dan (V2) Van (Me) Xe 
t=1 t=1 


SMES aire )y8 (ye) Xe 


t=1 
—1 k o o ofn, 0 Ofan, 0 o/ 
+n! N [Yan (Veo) Van(Ve1) — V4 (V2) 13 (Ve EXE 
t=1 


Provided the second term vanishes in probability (-P°) and a law of large 
numbers applies to the first term, we can take 


QZY) =n Y EYT 13 (V1) XP). 
t=1 


With V? (y) = var?(n!/2m2(7)), the Bates-White correspondences are 
again 


QAY VaT RACY), 
QII VAa) ms (Y), 
H; (7), 


H; (7) 


— V) 
3032 
Pr ee 
Cw 
Jv “Ne! 
ll 
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and the sufficient condition for asymptotic efficiency is again 


Qn (1) = Ca (77) Va") QRT"). 


Arguments parallel to those leading to Equation (4.3) show that the 
sufficient condition holds if 


QR) WR) 4 (Zito (2) )-¥3° (Za (VI)? = X?. 


It suffices that 3°(Z (D) = 27" and y3 (Z%(73)) = X?, as this en- 
sures Q2 (7*)! = V2(y*)7} 
We can now state our asymptotic efficiency result for the IV estimator. 


Theorem 4.62 (Efficient IV) Suppose the conditions of Theorem 4.54 
hold and that in addition P is such that for each P° in P 
GOY =X P rinm TEL a eR, 
where {€1, Fi} is an adapted stochastic sequence with E° (e+|Ft—1) = 0, 
t =1,2,...; and 


(ii) E° (erei|Fi-1) = Q? is an unknown finite and nonsingular p x p matriz 
a ee 


(iii) For each y ET, 
Zrt Y) = Yan (Zia(Y2)) Yan (Zi (11); 


where yY = (Y1, Y2, Y3, Y4), and Y3 = {Yan} and Y4 = {Y4n}, are such 
that Yan and Y4, are F-measurable for all n = 1,2,... and Zẹ (Yı) 


and Z?o(y2) are Fi—i-measurable for ally, and yj, t = 1,2,...; and 
there ezist 3 and y3 such that 


a) =n Sy (Zita (Ye) )13 (Ze (4) et 


and 


Q(T) = n7! X Ey (Zho(Y2)) 13 (Zea (V1) ) XP); 


t=1 


zaj ; w 
(iv) for some yi and ¥3, nme = Y3n(Zei(yi)), where 3°( (Yi) = 
no! se le ae and y4, Ñ: = yi, (Zen (v3)), where 73 °(Z2,(75)) 


Then 
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(a) Z°,(y*) = Xn, delivers 


Bx = [= Yuh 
t=1 


which is asymptotically efficient in E for P; 


1 n 
a antl 
n> XO, Ye, 
t=1 
(b) the asymptotic covariance matriz of B% is given by 
-1 
avar” (87) = h- NE (žarzr)| 
t=1 
which is consistently estimated by 
n n —ł4 
A Pers | = em, ee 
Ds no SY Xn Qe XP +27 SY” KP, x, 
t=1 t=1 


J o 
ie., Dn — avar" (84) S 0. 


Proof. (a) Apply Proposition 4.55 and the argument preceding the state- 
ment of Theorem 4.62. We leave the proof of part (b) as an exercise. W 


Exercise 4.63 Prove Theorem 4.62 (b). 


Other consistent estimators for avar°( 6%) are available, for example 


m ~1 
Drs h= Kula Rial : 
t=1 


This requires additional conditions, however. The estimator D,, above is 
consistent without further assumptions. 


Exercise 4.64 Suppose condition 4.62 (i) and (ii) hold and that 
OP = a. 
EX Fea) e Oe eS, PS 5 
where n° is anl x k matriz. 


(i) (a) Let èn = (n! Di EDP)” ys ZEX. 
Give simple conditions under which n = T° + Opo (1). 
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(b) Let Èn =n! 0, 18), where ê, = Y? — X?’ Bn, 
Bn = (X'Z(Z'Z) 1 Z'X)7*X/Z(ZZ) ZY 


(the 2SLS estimator). Give simple conditions under which Sp = D°+ 
Ope (1). 
(ii) Nezt, verify that conditions 4.62 (iit) and 4.62 (iv) hold for Z°,.(y*) = 


1 
ZeD, . The resulting estimator is the three stage least squares 
(38SLS) estimator. 
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CHAPTER 5 


Central Limit Theory 


In this chapter we study different versions of the central limit theorem that 
provide conditions guaranteeing the asymptotic normality of n~!/?X’e or 
n71/27’e required for the results of the previous chapter. As with laws 
of large numbers, different conditions will apply to different kinds of eco- 
nomic data. Central limit results are generally available for each of the 
situations considered in Chapter 3, and we shall pay particular attention 
to the parallels involved. 
The central limit theorems we consider are all of the following form. 


Proposition 5.0 Given restrictions on the moments, dependence, and het- 
erogeneity of a scalar sequence {Z}, (Zn — fin)/(Gn/V™) = Vn(Zn — 
jin) /On © N(0,1), where jin = E(Zn) and 6? /n = var(Zn)- 


In other words, under general conditions the sample average of a sequence 
has a limiting unit normal distribution when appropriately standardized. 
The results that follow specify precisely the restrictions that are sufficient 
to imply asymptotic normality. As with the laws of large numbers, there 
are natural trade-offs among these restrictions. Typically, greater depen- 
dence or heterogeneity is allowed at the expense of imposing more stringent 
moment requirements. 

Although the results of the preceding chapter imposed the asymptotic 
normality requirement on the joint distribution of vectors such as n~!/?X'e 
or n—!/2Z’e, it is actually only necessary to study central limit theory for 
sequences of scalars. This simplicity is a consequence of the following result. 
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Proposition 5.1 (Cramér-Wold device) Let {b,} be a sequence of ran- 
dom k x 1 vectors and suppose that for any real k x 1 vector A such that 
NA = 1, Xb, © NZ, where Z is ak x 1 vector with joint distribution 
function F. Then the limiting distribution function of bn exists and equals 
F. 


Proof. See Rao (1973, p. 123). m 
We shall apply this result by showing that under general conditions, 


n NO NVZ Ke ANZ 
t=] 
or 


ni? ` NV- Ze: 4 NZ, 
t=1 


where Z ~ N(O,I), which, by Proposition 5.1, allows us to obtain the 
desired conclusion, i.e., 


Vz !2n!/2X'e Å N(0,1) or Vp 72n !Z'e & (0,1). 


When used in this context below, the vector A will always be understood 
to have unit norm, i.e., A‘A = 1. 


5.1 Independent Identically Distributed 
Observations 


As with laws of large numbers, the case of independent identically dis- 
tributed observations is the simplest. 


Theorem 5.2 (Lindeberg-Lévy) Let {2Z:} be a sequence of i.i.d. ran- 
dom scalars, with u = E(Z+) and o? = var(Z) < oo. If o? #0, then 


Vn(Zn— fin)/Fn = Vn(Zn — p)/o 
ayz, u)/o © N(0,1). 


l 


Proof. Let f(A) be the characteristic function of Z; — p and let f,(A) be 
the characteristic function of /n(Zn — Bn)/Gn = n7! OP (Ze — p)/o. 
From Propositions 4.13 and 4.14 we have 


Faà) = FA/(ov2))” 


5.1 Independent Identically Distributed Observations 115 


log f(A) = n log f(A/(oV)). 


Taking a Taylor expansion of f(A) around À = 0 gives f(A) = 1—07A7/2+ 
0(A”) since o? < œœ, by Proposition 4.15. Hence 


log fn(A) = nlog[1 — A? /(2n) + 0(A?/n)| > —d? /2 


as n -+ oo. Hence fn(A) > exp(—A?/2). Since this is continuous at zero, it 
follows from the continuity theorem (Theorem 4.17), the Teie theo- 
rem (Theorem 4.11), and Example 4.10 (i) that /n(Zn—ft,)/Gn a N(0, 1). 
E 


Compared with the law of large numbers for i.i.d. observations, we impose 
a single additional requirement, i.e., that o? = var(Z+) < oo. Note that this 
implies E|Z:| < œ. (Why?) Also note that without loss of generality, we 
can set E(Z+) = 

We can apply Theorem 5.2 to give conditions that ensure that the con- 
ditions of Theorem 4.25 and Exercise 4.26 are satisfied. 


Theorem 5.3 Given 
(i) Y: = X;,bo + €t, a E 
(ii) {(X;,€:)} is an itd. sequence; 
(iii) (a) E(Xtet) = 0; 
(b) E|Xtnsetn|? < œ, h=1,...,p, i=l,... ,k; 
(c) Vn = var(n—/?2X’e) = V is positive definite; 
(iv) (a) E|Xtnil? < 00, h =1,...,p, 1=1,... ,k; 
(b) M = E(XK:X}) is positive definite. 
Then D~/2,/n(B,, — Bo) A N(0,1), where D = M-!VM7!. Suppose 
in addition that 
(v) there exists Vn symmetric and positive semidefinite such that Vix 
v & 0. 
Then D, —D > 0, where Ô„ = (X’X/n) Vn (X’'X/n)! 
Proof. We verify the conditions of Theorem 4.25. We apply Theorem 
5.2 and set Z = A’V~1/*Xize,. The summands A'V-!/2X;e; are i.i.d. 
given (ii), with E(Z:) = 0 given (iit.a), and var(Z:) = 1 given (iii.b) 
and (iii.c). Hence n712 Dy, Ze = nE AVX, 4 N(0,1) 
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by the Lindeberg-Lévy Theorem 5.2. It follows from Proposition 5.1 that 
V-1/2n-1/2X'e Å N (0,1), where V is O(1) given (iii.b) and positive defi- 
nite given (iii.c). It follows from Kolmogorov’s strong law of large numbers, 


Theorem 3.1, and from Theorem 2.24 that X'X/n — M — 0 given (ii) 
and (iv). Since the rest of the conditions of Theorem 4.25 are satisfied by 
assumption, the result follows. m 


In many cases V may simplify. For example, it may be known that 
E(e?|Xz) = 05 (p = 1). If so, 


Vi = E(XtererX}) = E(e2XeX}) EREE RX IX) 
E(E(e?|X:)X:X;) = 03 E(K:X;) = 05M. 


The obvious estimator for V is then V, = 62(X’X/n), where 62 is con- 
sistent for ø. A similar result holds for systems of equations in which it 
is known that FE(e:e;,|X+) = I (after suitable transformation of an under- 
lying DGP). Then V = M and a consistent estimator is V, = (X’X/n). 
Consistency results for more general cases are studied in the next chapter. 
In comparison with the consistency result for the OLS estimator, we 
have obtained the asymptotic normality result by imposing the additional 
second moment conditions of (iii.b) and (iii.c). Otherwise, the conditions 
are identical. A similar result holds for the IV estimator. 
Exercise 5.4 Prove the following result. Given 
(i) Y: = Xib +e €=1,2)...,8,€R*: 
(ii) {(Z1,X;,€:)} is an i.i.d. sequence; 
(iii) (a) E(Zrez) = 0; 
b) E|ZihiEth|? < 00, h=1,...,p,i=1,...,b; 
c) Vn = var(n7!/2Z’e) = V is positive definite; 
a) AER. < œ, h = ie ne a ee eee ah and j = foe oe 
b) Q = E(Z;X}) has full column rank; 
c) P,, > P, finite, symmetric, and positive definite. 


( 
( 
(iv) ( 
( 
( 


Then D-!/2./n(B,, — B,)  N(0,1), where 
D = (Q’PQ) *Q’PVPQ(Q’PQ)"? 


Suppose further that 


(v) there exists Vi symmetric and positive semidefinite such that Vis 
v 0. 
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Then Dn 4p 0, where 
D, = (X’ZPnZ'X/n?)—!(X'Z/n) Pp VnPn(Z'X/n)(X'ZP,Z'X/n?)7} 


Exercise 5.5 If p = 1 and E(e?|Z:) = oĉ, what is the efficient IV esti- 
mator? What is the natural estimator for V? What additional conditions 
ensure that Pa — P 2, 0 and V,- V 0? 


These results apply to observations from a random sample. However, they 
do not apply to situations such as the standard regression model with fixed 
regressors, or to stratified cross sections, because in these situations the 
elements of the sum n~1/? S~"_, Xzc: are no longer aeiia distributed. 
For example, with X; fixed and E(e?) = o2, var(Xtez) = o2X¢X}, which 
depends on X:X; and hence differs from observa loi to observation. For 
these cases we need to relax the identical distribution assumption. 


5.2 Independent Heterogeneously Distributed 
Observations 


Several different central limit theorems are available for the case in which 
our observations are not identically distributed. The most general result is 
in fact the centerpiece of all asymptotic distribution theory. 


Theorem 5.6 (Lindeberg-Feller) Let {Z,} be a sequence of indepen- 
dent random scalars with p, = E(Z), 0? = var(Z:) < œ, o? # 0 and 
distribution functions F}, t = 1,2,.... Then 


Vn(Zn — fin) /an © N(0,1) 
and 


he 
im, meno /2n) = 9, 


if and only if for every e > 0, 


lim öp n! N) (z — m) d F(z) =0. 


TRO t—1 Y (77H)? ena? 


Proof. See Loeve (1977, pp. 292-294). m 


The last condition of this result is called the Lindeberg condition. It 
essentially requires the average contribution of the extreme tails to the 
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variance of Z; to be zero in the limit. When the Lindeberg condition holds, 
not only does asymptotic normality follow, but the uniform asymptotic 
negligibility condition max;<t<nn~!(0?/a%) — 0 as n — œ also holds. 
This condition says that none of the Z; has a variance so great that it 
dominates the variance of Z,. Further, the case ø? = 0 for all t is ruled 
out by the Lindeberg condition. Thus maxi<i<n a? > 0, which by the Lin- 
deberg conditions implies that nē? — oo, so nē? = Bee a? is prevented 
from converging to some finite value. Together, asymptotic normality and 
uniform asymptotic negligibility imply the Lindeberg condition. 


Example 5.7 Let o? = #, 0 <p <1. Thenna? = $4 p* > p/(1— p) 
as n — œ, and 


max n~"(01/6n) = e/le/(1—p)| =1—p #0. 

t< 

Hence {Z,} is not uniformly asymptotically negligible. It follows that the 
Lindeberg condition is not satisfied, so asymptotic normality may or may 
not hold for such a sequence. 


Example 5.8 Let {Z;} be i.i.d. with o? = var(Z:) < co. By Theorem 5.2, 
Vn(Zn — fin)/En z N(0,1). Further, 5? = o°, so 


max n—!(0?/52) = n7! (o? /o?) -+ 0. 


1<t<n 
It follows that the Lindeberg condition is satisfied. 


Exercise 5.9 Give a direct demonstration that the Lindeberg condition is 
satisfied for identically distributed {Z,} with o? = var(Z;) < œ, so that 
Theorem 5.2 follows as a corollary to Theorem 5.6. Hint: apply the Mono- 
tone Convergence Theorem (Rao, 1978, p. 135). 


In general, the Lindeberg condition can be somewhat difficult to verify, 
so it is convenient to have a simpler condition that implies the Lindeberg 
condition. This is provided by the following result. 


Theorem 5.10 (Lispounoy:) Let {Z,} be a sequence of independent ran- 
dom scalars with p, = E(2Z), a = = var(Z;,), and E|Z, — pu,|2+® < A < œ 
for some 6 > 0 me allt. If 62 > 6 > 0 for alln sufficiently large, then 


Vii(Zn — fin)/Gn £ N(0,1). 


1 As stated, this result is actually a corollary to Liapounov’s original theorem. See 
Loeve (1977, p. 287). 


5.2 Independent Heterogeneously Distributed Observations 119 


Proof. We verify that the Lindeberg condition is satisfied. Define A = {z : 
(z — p)? > enā? }; then 


J e- mPa dF;(z j= f lz- ml E a A 
A 


6/2 


Whenever (z — p)? > enā?, it follows that |z — |f < (ena2)-*/?, so 


[ten miParey < (enady§? | le- yl? ar(2) 
A A 


(enn)? E| Z: — ml? 
< (enā?)®A. 


IA 


Hence for any e > 0, 

ao ?n} KC — ,)°dFi(z) < 55? (ena? )- 8/2 A = ng tert A, 
Since a2 > 6’,a72-§ < (6’)—!-8/2 for all n sufficiently large. It follows that 
a my fi (z — p,)?dFi(z) < n—8/2(6’)-1 8/26 8/2 -+0 as n —> oo. 


This result allows us to substitute the requirement that some moment of 
order slightly greater than two is uniformly bounded in place of the more 
complicated Lindeberg condition. Note that E|Z,|2+® < A also implies that 
E|Z; - pl? tê is uniformly bounded. Note also the analogy with Corollary 
3.9. There we obtained a law of large numbers for independent random 
variables by imposing a uniform bound on E£|Z:|!+ê. Now we can obtain a 
central limit theorem imposing a uniform bound on E|Z,|?+°. 

We seek an asymptotic normality result analogous to Theorem 5.3 for 
independent heterogeneous random variables. If we apply Theorem 5.10 
instead of Theorem 5.2, we run into a small difficulty. Recall that we applied 
the Cramér-Wold device to the sums n-1/2 S°?_, A’V-1/?X 2, where V = 
var(n—1/ Axe €). In the present case the random variables Xze+ are no longer 
identically distributed, and there is now no reason to suppose that V, 
is a constant or has a constant limit, in general. By analogy, we would 
like to apply the Cramér-Wold device to n71? Yi A’Vn 1/2X e+. But 


the summands X’ V} 1% Kie now depend explicitly on n, a possibility not 
covered by Theorem 5.10. Nevertheless, the needed generalization is readily 
available. 
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Theorem 5.11 Let {Znt} be a sequence of independent random scalars 
with pn = E(Znt), 024 = ee and E|Z,1|2+® < A < œ for some 
ô > 2 and all n and t. Define Z. Ft cates = er can 
and G2 = var(./nZ,) = n7! dex ,o2,. If a2 > 6 >0 for alln sufficiently 


large, then /n(Zn — An)/Ēn 2 ~ N(0,1). 
Proof. See Loeve (1977, pp. 287-290). m 


Exercise 5.12 Prove the following result. Given 


(Y SXK rin $124 Bee R 

(ii) {(X;,€t)} is an independent sequence; 

(iii) (a) E(Xrer) =0, t=1,2,...; 
(b) E|XthiEtn|? t? < A < œ for some 6 > 0 and all h = 1,... p, 
i = 1,...,k, and t; 
(c) Vn = var(n™!/? X'e) is uniformly positive definite; 

(iv) (a) E|X2,;|1+f < A < œ for some 6 > 0 and all h 
1=1,...,k, and t; 
(b) Mn = E(X'X/n) is uniformly positive definite. 


| 
—_ 
P 


Then Dr? /n(B, — 8.) À N(0,1), where Dn = M;!V,M;!. Suppose 
in addition that 
(v) there exists Vin symmetric and positive semidefinite such that V„ — 
V, — 0. 
Then Dn — Dn —> 0, where Dn = (X'X/n)-1V,(X'X/n)7} 
Note the general applicability of this result. We can let X: be fixed 
or stochastic (although independence is required), and the errors may be 


homoskedastic or heteroskedastic. A similarly general result holds for in- 
strumental variables estimators. 


Theorem 5.13 Given 
(i) Y: = Xib ten t=1,2,..., B, €R"; 
(ii) {(Z;},X;,E€t)} is an independent sequence; 
(iii) (a) E(Zrer) = 0, t=1,2,. 
(b) E|Zni€tn|2+® < A < œ Aor some 6 > 0 and all h = 1,... ,p, 
i = 1,...,l, and t; 
(c) Vn = var(n™!/?Z'e) is uniformly positive definite; 
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(iv) (a) E|ZiniXtnj|!+° < A < œ for some 6 > 0 and all h = 1,... p, 
i=1,...,l,j=1,... ,k, andt: 

(b) Qn = E(Z’X/n) has uniformly full column rank; 

(c) Ên- Pan > 0, where P, = O(1) and is symmetric and uniformly 

positive definite. 


Then Dr? /nl(Bn — B») > N(0,1), where 
Dn = (Qr Pnn) Quon Pn VnPnQn (Qn Pnn). 
Suppose in addition that 
(v) there exists Vn symmetric and positive semidefinite such that Vn — 
Vi > 0. 
Then D, —D, nes 0, where 
Dn = (X’ZP,,Z'X/n?)71(X'Z/n) Pn VnPn(Z’X/n)(X'ZP,,Z'X/n?)7}. 


Proof. We verify the conditions of Exercise 4.26. To apply Theorem 5.11, 
let Zar = XN V7 Ze, and consider n-!/2 DE XV Zei. The sum- 
mands Zw are independent given (ii) with E(Znt) = 0 given (iii.a), 52 = 1 
given (iii.c), and E|Z,pz|2+° uniformly bounded (apply Minkowski’s in- 
equality) given (iii.b). Hence 


n712 Y Zm =n? Y NV Lees £ N(0,1) 


t=1 t=1 


by Theorem 5.11 and V3 !/?n-1/2Z'e & N(0,1) by the Cramér-Wold de- 
vice, Proposition 5.1. 

Assumptions (ii), (iv.a), and (iv.b) ensure that Z'X/n — Qn —> 0 by 
Corollary 3.9 and Theorem 2.24. Since the remaining conditions of Exercise 
4.26 are satisfied by assumption, the result follows. m 


Note the close similarity of the present result to that of Exercise 5.4. 
We have dropped the identical distribution assumption made there at the 
expense of imposing just slightly more in the way of moment requirements 
in (iii.b) and (iv.a). Otherwise, the conditions are identical. This relatively 
minor trade-off has greatly increased the applicability of the results. Not 
only do the present results apply to situations with fixed regressors and 
either homoskedastic or heteroskedastic disturbances, but they also apply 
to cross-sectional data with either homoskedastic or heteroskedastic dis- 
turbances. Further, by setting 1 < p < ov, the present results apply to 
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panel data (i.e., time-series cross-sectional data) when p observations are 
available for each individual. 

As previously discussed, the independence assumption is not generally 
appropriate in time-series applications, so we now turn to central limit 
results applicable to time-series data. 


5.3 Dependent Identically Distributed 
Observations 


In the last two sections we saw that obtaining central limit theorems for 
independent processes typically required strengthening the moment restric- 
tions beyond what was sufficient for obtaining laws of large numbers. In 
the case of stationary ergodic processes, not only will we strengthen the 
moment requirements, but we will also impose stronger conditions on the 
memory of the process. 

To motivate the memory conditions that we add, consider a random 
scalar Z:+, and let F; be a o-algebra such that {2Z:,7;} is an adapted 
stochastic sequence (Z+ is measurable with respect to Fi and Fi-1 C Ft C 
F.) We can think of F as being the o—algebra generated by the entire 
current and past history of Z; or, more generally, as the o-algebra gener- 
ated by the entire current and past history of Z; as well as other random 
variables, say );. Given E| Z| < 00, we can write 


Zz = Li = E(21|Fit-1) ate E( Zil Fi). 
Similarly, 
Z, = Zi — E(Zi|Fe-1) + E(Zi|Fi-1)— E(Z|Ft_2) + E(Zi|Fi-2). 


Proceeding in this way we can write 
m—1 
A= X Ry t+ E(Z|Fi-m), m=1,2,..., 


j=0 


where Rz; is the revision made in forecasting Z+ when information becomes 
available at time t — 7: 


Rij = E(2Z1|Fi_;) = EZ). 
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Note that for fixed j, {R1;,7:_;} is a martingale difference sequence, be- 
cause it is an adapted stochastic sequence and 


E(Ruj|Fi-j-1) = ElE(2.\Fi-3) — E(Ze\Ft—j-1)|Ft—5-1) 
E(E(2Z:|Ft_j)|Fe—j—1] — ELE (ZelFe—j—1)|Ft—j-a] 
= E(Zi|Fi-j-1) — E( Z| Fe-j-1) = 9, 


where we have applied the linearity property and the law of iterated ex- 
pectations, Proposition 3.71. 

Thus we have written Z as a sum of martingale differences plus a re- 
mainder. The validity of the central limit theorem we discuss rests on the 
ability to write 


2: = Do Rey. 
j=0 


In this form, Z: is expressed as a “telescoping sum,” because adjacent ele- 
ments of R; cancel out. Among other things, the validity of this expression 
requires that E(Z:|Ft—-m) tend appropriately to zero as m — oo. Remem- 
ber that E (Z| Ft-m) is a random variable, so the convergence to zero must 
be stochastic. In fact, the condition we impose will imply that 


E((E(Z,|Fi_m)|*) — 0 as m — oo, 


which can be stated in terms of convergence in quadratic mean as defined 
in Chapter 2, i.e., 


E(Zi|\Fi-m) Z 0 as m > oo. 


One way of interpreting this condition is that as we forecast Z; based 
only on the information available at more and more distant points in the 
past, our forecast approaches zero (in a mean squared error sense). Further, 
this condition actually implies that E(Z+) = 0 as we prove next, so that 
as our forecast becomes based on less and less information, it approaches 
the forecast we would make with no information, i.e., the unconditional 
expectation E'( Zz). 


Lemma 5.14 Let {2Z:,7;} be an adapted stochastic sequence such that 
E(Z2) < œ, t = 1,2... , and suppose E(Zt|Fi_m) 5 0 asm > ov. 
Then E( Zt) = 0. 


Proof. By Theorem 2.40 E(Z:|Fi_m) Z 0 as m — œ implies that 
E(\E(Z:|Ft_m)|) Z5 0 as m — oo. Hence, for every € > 0 there exists 


124 5. Central Limit Theory 


M(e) such that 0 < E(|E(2:|Fi_m)|) < € for allm > M(e). By Jensen’s in- 
equality, |E{E(Ze|Femn)]l < E(E(Z:lFi-m)|), 90 0 < |B[E(Zi|Fi-m)ll < 
€ for all m > M(e). But by the law of iterated expectations, E(Z:) = 
E|E(Zt|Fi-m)], so 0 < |E(Z:)| < £. Since e is arbitrary, it follows that 
E(2Z) =0. m 

In establishing the central limit result, it is necessary to have G2 = 


var(\/nZ,) finite. However, for this it does not suffice simply to have o? = 
var(Z,) finite. Inspecting a2, we see that 


a2 = nvar(Zn) 
e 2 
nE (5a) 
t=1 


n n—l n 
= a BDNE NY Y Eih 
t=1 


rTr=l t=r+1 


When Zz is stationary, p, = E(Z:Z:_7)/o* does not depend on t. Hence, 


n-1 
a = o7+20°n7} S (n — T)p, 
T=1 
n—-1 
= o? +20? ` p,(1—T/n). 
T=1 


This last term contains a growing number of terms as n — oo, and without 
further conditions is not guaranteed to converge. It turns out that the 
condition 

CO 

X (ElE(Zo|F_-m)"))'? < 00 


m=0 


is sufficient to ensure that p, declines fast enough that a2 converges to a 
finite limit, say, 7, as n — oo and that this, together with stationarity and 
ergodicity, provides enough structure to obtain a central limit result. 

In order to give a convenient statement of an ergodic central limit theo- 
rem, we introduce the notion of an adapted mizingale process. 


Definition 5.15 Let {Z;,F,} be an adapted stochastic sequence with E(Z?) 
< oo. Then {Z;,F,} is an adapted mixingale if there exist finite nonnega- 
tive sequences {c+} and {ym} such that Ym > 0 as m — œ and 


1/2 


(E (E(Zi|Frem)*)) og See 
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We say Ym is of size —a if ym = O(m~*—®) for some e > 0. 


The notion of a mixingale is due to McLeish (1974). Note that in this 
definition {Z:} need not be stationary or ergodic, but may be heteroge- 
neous. In the mixingale definition of McLeish, F need not be adapted to 
Zt. Nevertheless, we impose this because it simplifies matters nicely and 
is sufficient for all our applications. As the name is intended to suggest, 
mixingale processes have attributes of both mixing processes and martin- 
gale difference processes. They can be thought of as processes that behave 
“asymptotically” like martingale difference processes, analogous to mixing 
processes, which behave “asymptotically” like independent processes. 


Theorem 5.16 (Scott) Let {Z:, Fr} be a stationary ergodic adapted miz- 
ingale with y,, of size —1. Then G2 = var(n71/? Se Z) => G < was 
n —> œ and if G2 > 0, then n-1/2Z, /@ & N(0,1). 


Proof. Under the conditions given, the result follows from Theorem 3 of 
Scott (1973), provided that 


5 f(x [EAF] + (E [Zo - B(Z0IFn))?) "| < 00. 


Because Zo is F,,-measurable for each m > 1, E(Zo|Fim) = Zo and the 
second term in the summation vanishes. It suffices then that 


5 GE (ZolFm)*]) Bes 
m=1 


Applying the mixingale and stationarity conditions we have 


oo 91\ 1/2 = 

S EEE]? < oS 
= cod So m+) 
< © = 


with A < œ, as Ym is of size —1. m 


A related but more general result is given by Heyde (1975), where the 
interested reader can find further details of the underlying mathematics. 

Applying Theorem 5.16 and Proposition 5.1 we obtain the following re- 
sult for the OLS estimator. 
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Theorem 5.17 Given 


O V¥e= XB, és. $2] 12.44.48, eR: 
(ii) {(X4,et)} is a stationary ergodic sequence; 
(iii) (a) ee is an adapted mizingale of size —1, h = 1,...,p, 
iS le 
b) E\Xanceenl? £00; Wa 1 cs, (EHV he 


c = var(n—1/2X'e) is uniformly positive definite; 


( 
(e) V 
(iv) (a) Piat <oo, h=1,...,p,i=1,... ,k; 
(b) M = E(X:X?}) is positive definite; 


Then V, — V finite and positive definite asn — co, and D-1/ 2/n(B — 
Bo) a N(0,1), where D = M-!VM7"?. 
Suppose in addition that 


(v) there exists V,, symmetric and positive semidefinite such that Vz, — 
Va a O. 


Then Ô„ — D -> 0, where Dy, = (X’X/n)~!Vp(X'X/n)7} 


Proof. We verify the conditions of Theorem 4.25. First we apply Theorem 
5.16 and Proposition 5.1 to show that Vn /?n-!/2X’e © N(0,1). Consider 
n 1/2 ee ' V—'/2X1e1, where V is any finite positive definite matrix. 
By Theorem 3.35, {Z; = A’V~1/2X,ze;} is a stationary ergodic sequence 
given (iz), and {Z:, Fi} is an adapted stochastic sequence because Z; is 
measurable with respect to F; by Proposition 3.23, and Rr- C Fi C F. 
To see that E(Z?) < œœ, note that we can write 


Li = NV-"/2Xie; 


A VT D a 


I 
Me 


a 
lI 
j= 


k 
y A,X thiEths 


1 ¿=1 


Il 
Me 


> 
Il 


where À; is the ith element of the k x 1 vector A= V-12). By definition 
of À and V, there exists A < oo such that |A;| < A for all i. It follows from 
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Minkowski’s inequality that 


p k i 2 
Ben < [ES (Axnet) 


h=1 i=1 


< ayy (E|Xtniétn|”) J 


h=1 i=1 
[ApkA!/?]? < oœ, 


2 


IA 


since for A sufficiently large, E|X:inietn|? < A < œ given (iii.b) and the 
stationarity assumption. Next, we show {Z:, F+} is a mixingale of size —1. 
Using the expression for Z; just given, we can write 


e (fe (Erasma) 


E((E(Zo|F—-m)]”) m 


el IS SE (XiXonionl Fm ) 


Applying Minkowski’s inequality it follows that 


2 
E((E(Zo|F_m)]*) (E [EA Xonio Fm)" ) 7 


iM» 
i 


2 
(E oncom") 


IA 

> 
Me 
= iM 


h=1i=1 
2 
< aS Yn 
h=1 i=1 
< [ApktoFn]” , 


where čo = maxniConi < 00 and Fm = MAaXp,i Ymni İS Of size —1. Thus, 
{Z+, Fe} is a mixingale of size —1 as required. 
By Theorem 5.16, it follows that 


I 


var(./nZn) var (wo `> Xv 


t=1 


AVIV, VIA = & < 00. 
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Hence V„ converges to a finite matrix. Now set V = lim p-o Vn, which 
is positive definite given (iii.c). Then 6? = A'V-1/2VVT1/2A = 1. It then 


follows from Theorem 5.16 that n-1/2 5%; A'V-1/2 Xe: a N(0, 1). Since 
this holds for every A such that A'A = 1, it follows from Proposition 5.1 


that V-1/2n-1/2 57 _, X:et © N(0,I). Now 


Voge X Xes V-1/2,-1/2 X Xie: 


t=1 t=1 


= (V71/2y1/2 _ IV 71/2 So Kier eB 0, 


t=1 


since Va '/?V1/2 — I is 0(1) by Definition 2.5 and 


V~12n-1/2 Y Xe: © N(0,1), 


t=1 


which allows application of Lemma 4.6. Hence by Lemma 4.7, 
Vien V2x'e k N(O,I). 


Next, X’X/n—M -5 0 by the ergodic theorem (Theorem 3.34) and The- 
orem 2.24 given (iz) and (iv), where M is finite and positive definite. Since 


the conditions of Theorem 4.25 are satisfied, it follows that Dn My 2 Jn (Bn — 
Bo) k N (0,1), where D, = M~!V,,M7—!. Because D,, — D — 0 as n —> 00, 
it follows that 


D12 /n(Ân ~~ Bo) ra D7? Vn( Ân = Bo) 
= (DD, - Dz Vn, — 8.) + 0 


by Lemma 4.6. Hence, by Lemma 4.7, 


D7? /n(B, a Bo) Å N(0, I). 


a 

Comparing this result with the OLS result in Theorem 5.3 for i.i.d. regres- 
sors, we have replaced the i.i.d. assumption with stationarity, ergodicity, 
and the mixingale requirements of (iii.a). Because these conditions are al- 
ways satisfied for i.i.d. sequences, Theorem 5.3 is in fact a direct corollary 
of Theorem 5.16. (Condition (iii.a) is satisfied because for i.i.d. sequences 
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E(Xoni€on|F-m) = 0 for all m > 0.) Note that these conditions impose 
the restrictions placed on Z; in Theorem 5.16 for each regressor-error cross 
product XthiEth 

Although the present result now allows for the possibility that X+ con- 
tains lagged dependent variables Y;-1, Y:-2,... , it does not allow ep to 
be serially correlated at the same time. This is ruled out by (iii.a) which 
implies E(Xze:) = 0 by Lemma 5.14. This condition will be violated if 
lagged dependent variables are present when ez is serially correlated. Also 
note that if lagged dependent variables are present in X;, condition (iv.a) 
requires that E(Y,’) is finite. This in turn places restrictions on the possible 
values allowed for G,. 


Exercise 5.18 Suppose that the data are generated as Y; = Bo, ¥1-1 + 
Bo2Yı—2 + Et. State general conditions on {Y;} and (B,,, 8,2) which ensure 
the consistency and asymptotic normality of the OLS estimator for 3,, and 


Boz- 


As just mentioned, OLS is inappropriate when the model contains lagged 
dependent variables in the presence of serially correlated errors. However, 
useful instrumental variables estimators are often available. 


Exercise 5.19 Prove the following result. Given 
(i) Y: = Xib, +e t=1,2,..., Bo E RF; 
(ii) {(Z;,X},€:)} is a stationary ergodic sequence; 
(iii) (a) {Ztni€th, Ft} is an adapted mizingale of size —1, h = 1,... ,p, 
eo) E 
b) E |ZthiEth |? LO IES Tae Spy = ewe 
) Vn = var(n-1/2Z'e) is uniformly positive definite; 
a) E|ZtniXthj|< oo, h=1,...,p,i=1,...,l, andj =1,... ,k; 
) Q = E(Z;X}) has full column rank; 
(c) Ên 4 P finite, symmetric, and positive definite. 
Then Vn — V finite and positive definite as n — œ, and D~!/ 2 /n(Bn— 
Bo) n N(0, I), where 


D = (Q'PQ) 'Q'PVPQ(Q'PQV!. 


(iv) 


Suppose further that 


(v) there erists V,, symmetric and positive semidefinite such that Ve 
v —0. 
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Then D, — D — 0, where 
D, = (X’ZP,,Z'X/n?)71(X'Z/n)PpVnPp(Z'X/n)(X'ZPpZ'X/n?)7}. 


This result follows as a corollary to a more general theorem for nonlinear 
equations given by Hansen (1982). However, all the essential features of his 
assumptions are illustrated in the present result. 

Since the results of this section are based on a stationarity assumption, 
unconditional heteroskedasticity is explicitly ruled out. However, condi- 
tional heteroskedasticity is nevertheless a possibility, so efficiency improve- 
ments along the lines of Theorem 4.55 may be obtained by accounting for 
conditional heteroskedasticity. 


5.4 Dependent Heterogeneously Distributed 
Observations 


To allow for situations in which the errors exhibit unconditional heteroske- 
dasticity, or the explanatory variables contain fixed as well as lagged de- 
pendent variables, we apply central limit results for sequences of mixing 
random variables. A convenient version of the Liapounov theorem for mix- 
ing processes is the following. 


Theorem 5.20 (Wooldridge-White) Let {Zn:} be a double array of 
scalars with p pı = E(Zy) = 0 and 02, = var(Znt) such that E| Zm" < 
A < œ for some r > 2, and all n and t, and having mizing coeffi- 
cients p of size —r/2(r — 1), or a of size —r/(r — 2), r > 2. If = 
var (nue X1 Zt) > 6 > 0 for all n sufficiently large, then Jn Zn — 


jin)/On * N(0, 1). 


Proof. The result follows from Corollary 3.1 of Wooldridge and White 
(1988) applied to the random variables Zp: = Znt/ĒŪn. See also Wooldridge 
(1986, Ch. 3, Corollary 4.4.) m 


Compared with Theorem 5.11, the moment requirements are now po- 
tentially stronger to allow for considerably more dependence in Z;. Note, 
however, that if ¢(m) or a(m) decrease exponentially in m, we can set r 
arbitrarily close to two, implying essentially the same moment restrictions 
as in the independent case. 

The analog to Exercise 5.12 is as follows. 
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Exercise 5.21 Prove the following result. Given 


(i) Yi = X16, + Et, pe 10.0410, Ro 

(ii) {(X;,€t)} is a mizing sequence with either ¢ of size —r/2(r—1), r > 2 
or @ of size —r/(r —2),r > 2; 

(iii) (a) E(Xzez) = 0, t=1,2,. 
(b) E|Xenietn|” < A < ee 543,49) Ss Lake he One-Out: 
(c) Vn = var (n~! Si Xtex) is uniformly positive definite; 

(iv) (a) a |O/2+> < A < œ for some 6 > 0 and all h = 1,... p, 
i = ,k, and t; 
(b) Mn = E(X'X/n) is uniformly positive definite. 


Then Dn Vitae Bo) z N(0,1), where D, = M} V „M; !. Suppose 
in addition that 


(v) there exists Vn symmetric and positive semidefinite such that Via = 
V, => 0. 


Then Dn — Dn —> 0, where Dn = (X’X/n)-!V,,(X'X/n)7} 


Compared with Exercise 5.12, we have relaxed the memory requirement 
from independence to mixing (asymptotic independence). Depending on 
the amount of dependence the observations exhibit, the moment conditions 
may or may not be stronger than those of Exercise 5.12. 

The flexibility gained by dispensing with the stationarity assumption of 
Theorem 5.17 is that the present result can accommodate the inclusion of 
fixed regressors as well as lagged dependent variables in the explanatory 
variables of the model. The price paid is an increase in the moment restric- 
tions, as well as an increase in the strength of the memory conditions. 


Exercise 5.22 Suppose the data are generated as Yı = Bo1Yt-1 + Boz Wt + 
Ez, where W; is a fixed scalar. Let Xt = (Yi-1, W+) and provide conditions 
on {(Xz, €x)’} and (B01, B42) that ensure that the OLS estimator of 3; and 
Bog is consistent and asymptotically normal. 


The result for the instrumental variables estimator is the following. 


Theorem 5.23 Given 


(i) Yi = X;bo + Et, b= 12.28 5 Bo eR": 


(ii) {(Z), Xt, e2)} is a mizing sequence with either ¢ of size —r/2(r — 1), 
r >2 ora of size —r/(r —2), r > 2; 
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(iii) (a) E(Zeez) = 0, t=1,2,...; 
b) E|Ztni€rn|” < A < œ forh=1,...,p,i=1,...,l, and all t; 
c) Vn = var (n1 Dya i Zet), is uniformly positive definite; 


( 
( 
( 
(iv) (a) EZX S < A < œ for 6 > 0 and all h = 1,...,p, 

i=1,...,l4, j=1,...,k, and t; 

( 

(c 


b) Qn = E(Z' X/n) has uniformly full column rank; 


) P,, —P, ele 0, where P,, = O(1) and is symmetric and uniformly 
positive definite. 


Then D; Jn(B, — B.) Es N(0, I), where 
Dr = (Q,PnQn)*Q,PnVnPnQn(Q,PnQn) 


Suppose further that 
(v) there exists Vn symmetric and positive semidefinite such that V, — 
Vn => 0. 
Then Dn —D, a 0, where 
Dn = (X’ZP,,Z'X/n?)71(X'Z/n) Pp VnP,(Z'X/n)(X'ZP,Z/X/n?)7! 


Proof. We verify that the conditions of Exercise 4.26 hold. First we apply 
Proposition 5.1 to show Vz n!e A N (0,1). Consider 


n71 NO NV P Zier. 


t=1 


By Theorem 3.49, XVn Je Z:Et is a sequence of mixing random variables 
with either ¢ of size —r/2(r—1), r > 2 or a of size —r/(r—1), r > 2, given 
(ii). Further, F(X’ Vil? Ze) = = 0 given (iii.a), E|’ V Va Ziel? <A< 
oo for all ¢ given (iii.b), and 


52 = var (= ONVZ) SNV VaV PA= L. 


It follows from Theorem 5.20 that n™1/2 57? AV aay ee N(O,I). 
Since this holds for every A, A'A = 1, it follows To Proposition 5.1 that 


Vz n12 5 Zier © N(0, 1). 
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Next, Z'X/n — Qn > 0 by Corollary 3.48 given (iv.a), where {Qn} is 
O(1) and has uniformly full column rank by (iv.a) and (iv.b). Since (iv.c) 
also holds, the desired result now follows from Exercise 4.26. m 


This result is in a sense the most general of all the results that we have 
obtained, because it contains so many special cases. Specifically, it covers 
every situation previously considered (i.i.d., i.h.d., and d.i.d. observations), 
although at the explicit cost of imposing slightly stronger conditions in 
various respects. Note, too, this result applies to systems of equations or 
panel data since we can choose p > 1. 


5.5 Martingale Difference Sequences 


In Chapter 3 we discussed laws of large numbers for martingale difference 
sequences and mentioned that economic theory is often used to justify the 
martingale diffe.ence assumption. If the martingale difference assumption 
is valid, then it often allows us to simplify or weaken some of the other 
conditions imposed in establishing the asymptotic normality of our estima- 
tors. 

There are a variety of central limit theorems available for martingale 
difference sequences. One version that is relatively convenient is an ex- 
tension of the Lindeberg-Feller theorem (Theorem 5.6). In stating it, we 
consider sequences of random variables {Z,;} and associated o-algebras 
{Fnt,1 < t < n}, where Fnt-1 C Fne and Zne is measurable with respect 
to Frnt. We can think of Fn: as being the o-field generated by the current 
and past of Z,; as well as any other relevant random variables. 


Theorem 5.24 Let {Znt, Fnt} be a martingale difference sequence such 
that o2, = E(Z2,) < œ, o2, £0, and let Fat be the distribution function 
of Zac Define Zn = n7! Y; Zn onde? = var (VnZn) = n7t ers 
If for every e > 0 


TL OO 


lim a f z?dFat(z) = 0, 
t=1 22 >ene? 
and 
n7! yao A 
t=1 


then /NZn/Fn À N(0,1). 
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Proof. This follows immediately as a corollary to Theorem 2.3 of McLeish 
(1974). m 

Comparing this result with the Lindeberg-Feller theorem, we see that 
both impose the Lindeberg condition, whereas the independence assump- 
tion has here been weakened to the martingale difference assumption. The 
present result also imposes a condition not explicit in the Lindeberg-Feller 
theorem, i.e., essentially that the sample variance n`! 57;_, Z2, is a con- 
sistent estimator for 2. This condition is unnecessary in the independent 
case because it is implied there by the Lindeberg condition. Without in- 
dependence, we make use of additional conditions, e.g., stationarity and 
ergodicity or mixing, to ensure that the sample variance is indeed consis- 


tent for a2. 
To illustrate how use of the martingale difference assumption allows us 


to simplify our results, consider the IV estimator in the case of stationary 
observations. We have the following result. 


Theorem 5.25 Suppose conditions (i), (ii), (iv) and (v) of Exercise 5.19 
hold, and replace condition (iii) with 
(iii) (a) E(Zeni€tn|Ft-1) =0 for allt, where {Fz} is adapted to {Ztni€tn}, 
ey 6.0.69 = ial 
(b) E|Zinietnl? < 00, He tees Ny iy eo Ceres 5 
(c) Vn = var(n—1/2Z/e) = var(Z,e1) = V is nonsingular. 
Then the conclusions of Exercise 5.19 hold. 


Proof. One way to prove this is to show that (iii’) implies (iii). This is 
direct, and it is left to the reader to verify. 
Alternatively, we can apply Proposition 5.1 and Theorem 5.24 to verify 


that Viz n-1/2Z'e z N(0,I). Since {Ze:} is a stationary martingale 
difference sequence, 


var (nze) Sn ` E(Zr€1€,Z;) = V, 
t=1 
finite by (27i’.b) and positive definite by (iiz’.c). Hence, consider 


ne SS NV-/?Z,e4. 


t=1 


By Proposition 3.23, ’V~!/2 Ze: is measurable with respect to Fz given 
(iii'.a). Writing NV- 27.6, = ee ea 1 Xs Zthi€th, it follows from the 
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linearity of condition expectations that 


l 
BE(N V? Zrer|Fe-1) = X SAE (Zini€th|Fr—1) = 0 


h=1 i=1 


given (iii’.a). Hence {X’ V-Z iei Fı} is a martingale difference sequence. 
As a consequence of stationarity, var(A’V~1/2Z,e,) = XV 12V VTA = 
1 for all t, and for all t, F e = F, the distribution function of X’ V-12 Z;e;. 
It follows from Exercise 5.9 that the Lindeberg condition is satisfied. Since 
{A V12 Zere, Z, VT!/2A} is a stationary and ergodic sequence by Propo- 
sition 3.30 with finite expected absolute values given (iii’.b) and (ii7’.c), the 
ergodic theorem (Theorem 3.34) and Theorem 2.24 imply 


n! NO NV Y? Zere, ZVA — AVTV VYA 


t=1 


= nY NV P Zere, V A-1 0. 


t=1 


Hence, by Theorem 5.24 n~! Y i d'V ZE: À N(0,1). It follows 
from Proposition 5.1 that V~1/2n-1/2Z’e & N(0,I), and since V = Vn, 
Vien /2Ze 7 N (0,1). The rest of the results follow as before. m 

Whereas use of the martingale difference assumption allows us to state 
simpler conditions for stationary ergodic processes, it also allows us to state 
weaker conditions on certain aspects of the behavior of mixing processes. 
To do this conveniently, we apply a Liapounov-like corollary to the central 
limit theorem just given. 


Corollary 5.26 Let {Znt,Fnt} be a martingale difference sequence such 
that E|Znt|2t? < A < œ for some 6 > 0 and all n and t. If a2 > 6’ > 0 
for alln sufficiently large and n-! \~}_, Z2,- 42 5 0, then /nZn/an A 
N(0, 1). 

Proof. Given E|Z,1|2+® < A < oo, the Lindeberg condition holds as 
shown in the proof of Theorem 5.10. Since c2 > 6 > 0, a? is O(1), so 
n-1 Pia 22,/62 — 1 = o72(n! SE, Z2 — a2) 4 0 by Exercise 2.35. 
The conditions of Theorem 5.24 hold and the result follows. m 

We use this result to obtain an analog to Theorem 5.25. 


Exercise 5.27 Prove the following. Suppose conditions (i), (it), (iv), and 
(v) of Theorem 5.23 hold, and replace (iii) with 
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(iii') (a) E(Zeni€en|Fe-1) = 0 for allt, where {Fi} is adapted to {Ztni€tn}, 
| E NE eg eal Re 
(b) E \Ztni€tal” < ee ee ..,p,t=1,...,l and allt; 
(c) Vn = var(n—1/2Z/e) is uniformly positive definite. 


Then the conclusions of Theorem 5.23 hold. 


Note that although the assumption (iii.a) has been strengthened from 
E(Zt€,) = 0 to the martingale difference assumption, we have maintained 
the moment requirements of (777.0). 


References 


Hansen, L. P. (1982). “Large sample properties of generalized method of moments 
estimators.” Econometrica, 50, 1029-1054. 


Heyde, C. C. (1975). “On the central limit theorem and iterated logarithm law 
for stationary processes.” Bulletin of the Australian Mathematical Society, 12, 
1-8. 

Loeve, M. (1977). Probability Theory. Springer-Verlag, New York. 


McLeish, D. L. (1974). “Dependent central limit theorems and invariance princi- 
ples.” Annals of Probability, 2, 620-628. 


Rao, C. R. (1973). Linear Statistical Inference and Its Applications. Wiley, New 
York. 


Scott, D. J. (1973). “Central limit theorems for martingales and for processes with 


stationary increments using a Skorokhod representation approach.” Advances 
in Applied Probability, 5, 119-137. 


Wooldridge, J. M. (1986). Asymptotic Properties of Econometric Estimators. Un- 
published Ph.D. dissertation. University of California, San Diego. 


and H. White (1988). “Some invariance principles and central limit theo- 
rems for dependent heterogeneous processes.” Econometric Theory, 4, 210-230. 


CHAPTER 6 


Estimating Asymptotic Covariance Matrices 


In all the preceding chapters, we defined Vn = var(n™!/2X'e) or V, = 
var(n~1/2Z'e) and assumed that a consistent estimator V,, for Wn is avail- 
able. In this chapter we obtain conditions that allow us to find convenient 
consistent estimators Vn. Because the theory of estimating var(n71/2X’e) 
is identical to that of estimating var(n—!/2Z’e), we consider only the latter. 
Further, because the optimal choice for P, is V71, as we saw in Chapter 
4, conditions that permit consistent estimation of V, will also permit con- 
sistent estimation of P, = V;! by Ê, = V;!. 


6.1 General Structure of V,, 


Before proceeding to look at special cases, it is helpful to examine the 
general form of Vn. We have 


Vn = var(n7)/?Z'e) = E(Z'ee'Z/n), 


because we assume that E(n~1/?Z/e) = 0. In terms of individual observa- 
tions, this can be expressed as 


Vn =E (r l Stace 


t=1 7=1 
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An equivalent way of writing the summation on the right is helpful in 
obtaining further insight. We can also write 


n 
Va = nX E(Zye.€,Z)) 
t=1 
n—-l n 
+n} ` SS E( Zier, Zi + Zia epi7€;Z,) 
T=1 t=T+1 
n 
= n`! X var(Zrer) 
t=1 
n—1 n 
+n! ` ` cov (ZE, Zt—rEt—7) + cov(Zi—rEt-r, ZE). 
r=} t=r+1 


The last expression reveals that V, is the average of the variances of 
Z€: plus a term that takes into account the covariances between Ze: 
and Z,;_,+€,—7 for all t and r. We consider three important special cases. 

Case 1. The first case we consider occurs when {Z;,eé,;} is uncorrelated, 
so that 


COV(ZtEt, Zt-rEt-r) = cov(Zr_7€4_7, Zr€r)’ = O 


for all t Æ T, so 
Vee? NO E (Zee, 21). 
t=1 


This occurs when {(Z/, €z)} is an independent sequence or when {Zer Ft} 
is a martingale difference sequence for some adapted o-fields F. 

Case 2. The next case we consider occurs when {Z,€;} is “finitely cor- 
related,” so that 


cov(Zr€2, Zt~rEt-7) = cov(Zr_7€¢_7, Zet) = 0 


for all r >m, 1 <m < œ, so 


n 


Van = n`! X E(Ziere,Z) 
t=1 


a 2 De E(Zy€1€,_,Z,_,) + E(Zt_1€t-7€4Z;). 


T=1t=T+1 
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This case arises when F(Zze:|7;_,) = 0 for some 7, 1 < T < œ and 
adapted o-fields F. A simple example of this case arises when Z+ is non- 
stochastic and ez is a scalar MA(1) process, i.e., 


Et = Ue + Vil, 


where {vz} is an i.i.d. sequence with E(v,) = 0. Setting F; = o(... ,€2), we 
readily verify that E(Zrez|F:_r) = ZE (et|Ft--) = 0 for T > 2, implying 
that 


Vn = n`! NO E(ZiereeZi) 


t=1 


gn NO E(Zeever-1Z;_1) + E(Zt_-1€t-1€1Z,). 
t=2 


Case 3. The last case that we consider occurs when {Z€} is an asymp- 
totically uncorrelated sequence so that 


cov(Zr€1, Zt-r€t—r) = COv(Zz_7€1-7, ZE) +O ast oo. 


Rather than making direct use of the assumption that { (Zi, e,)} is asymp- 
totically uncorrelated, we shall assume that {(Z;, €¢)} is a mixing sequence, 
which will suffice for asymptotic uncorrelatedness. 

In what follows, we typically assume that nothing more is known about 
the correlation structure of {Z:e:} beyond that it falls in one of these 
three cases. If additional knowledge were available, then it could be used 
to estimate V,; more importantly, however, it could be used to obtain 
efficient estimators as discussed in Chapter 4. The analysis of this chapter 
is thus relevant in the common situation in which this knowledge is absent. 


6.2 Case 1: {Z,e;} Uncorrelated 


In this section, we treat the case in which 
n 
Vn =n! XO B(Zre1€¢Zi). 
t=1 


A special case of major importance arises when 


E(eze,|Ze) = 0 I, 
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so that 


Vaan > of E(Z:Zi) =o2ln. 


t=1 
Our first result applies to this case. 


Theorem 6.1 Suppose Vn = o2Ln, where o? < œ and Ly is O(1). If 
there exists 6% such that 62 + a? and if Z'Z/n — Ln ,0, then Vp = 


&?7'Z/n is such that Vn — Vn —> 0. 
Proof. Immediate from Proposition 2.30. m 


Exercise 6.2 Using Exercise 3.79, find conditions that ensure that a7 2P, 


o2 and Z'Z/n — Ln > 0, where 62 = (Y — XB) (Y — XBn)/(np). 


Conditions under which a2 = oa and Z’Z/n—Ln —, 0 are easily found 
from the results of Chapter 3. 

In the remainder of this section we consider the cases in which the se- 
quence {(Z;,X/,e:)} is stationary or {(Z},X},ez)} is a heterogeneous se- 
quence. We invoke the martingale difference assumption in each case, which 
allows results for independent observations to follow as direct corollaries. 

The results that we obtain next are motivated by the following consid- 
erations. We are interested in estimating 


Vn =n X E(Ziere, Z). 
t=1 


If both Z: and €; were observable, a consistent estimator is easily available 
from the results of Chapter 3, say, 


For example, if{(Z/, €z)} were a stationary ergodic sequence, then as long as 
the elements of Z:e€:€; Z, have finite expected absolute value, it follows from 


the ergodic theorem that Vn- V, £5 0. Of course, €+ is not observable. 
However, it can be estimated by 


Ey = Yi E tBns 
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where Brn is consistent for G,. This leads us to consider estimators of the 
form 


n—n a 3 Zr€1€, LZ; 


As we prove next, replacing €; with €; makes no difference asymptotically 
under general conditions, so V„— Vn 45 0. These conditions are precisely 
specified for stationary sequences by the next result. 


Theorem 6.3 Suppose that 


(i) Y: = XB, +e t=1,2,..., Bo E€ RF; 
(ii) {(Z},X},€2)} is a stationary ergodic sequence; 


(iii) (a) {Ziet, Fe} is a martingale difference sequence; 
(b) El|Zinseun|? < œ, h =1,...,p, = 1,..., 6 
(c) Vn = var(n71/2Z’e) = var(Zrer) = V is positive definite; 


(iv) (a) E| ZthiXtnjl? < œ, h=1,...,p,i=1,...,1,7=1,... ,k; 
(b) Q = E(Z:X;) has full column rank; 


(c) P, 4 P, finite, symmetric, and positive definite. 
Then V, — V = 0, and Vz! - V~! 440. 


Proof. By definition and assumption (iii.a), 


Vs = n7! Yo zEEiZi - E(Ziere; Z). 
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Now Et = Y; 7 X! Bn = Y, 7% X; bo K X; (Bn = Bo) = Et — x (B; = Bo), 


given assumption (i). Substituting for +, we have 


Vn -V = n} AG: — Xi(B, — Bo) ) (Et - X; (Bn — Bo))' Z; 
t=1 


—E(Zrere} Zi) 


n ` Zrere,L, — E(Zre1€;Z;) 


t=1 


1 <> ZX! (Bn — B,)e,Z! 
=n! 3 Z1€1(Bn r Bo) XıZı 
t=1 
+n7! ` ZX; (Bn E Bo) (Bn = Bo) XZ. 
t=1 


Given (ii) and a b), the ergodic theorem ensures that n`! $] Zree,Zi— 
E(Zie:e;, Zi) 25, 0, and therefore also vanishes in probability. It suffices 
that the remaining terms also vanish in probability, by Exercise 2.35. 

To analyze the remaining terms, recall that vec( ABC) = (C’@A)vec(B), 
and apply this to the second summation with A, B, and C chosen to give 


vec G SO ZX B, — B, eizi) = =n! Dve (z:x; (Ba —B, je, Zi) 


t=1 


=n! 3 (Zie: Q ZX!) vec(B,, — Bo). 
t=1 


The conditions of the theorem ensure that Bn —, B, by Exercise 3.38. To 
conclude that this term vanishes in probability, it suffices that n~! pee 
(Ziet © ZX;) is Op(1) by Corollary 2.36. But this is guaranteed by the 
ergodic theorem, provided that E (Ze, © ZX; ) is finite. Conditions (iii.b) 
and (iv.a) ensure this, as can be verified by repeated application of the 
Cauchy-Schwarz inequality. 

Thus, 


nS (Ziet © ZiXt) vec(B, — Bo) > 0. 


t=1 
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Similarly, apply vec(ABC) = (C’ ® A)vec(B) to obtain 


vec Gs XO ZX: (Bn — Bo)(Bn — a,x] 


t=1 


=n} 2 (ZX; © ZiX,) vec ( (Bn — Bo) (Bn — Bo)') . 


Again we have Bn =; —, 0, so this term vanishes provided that 
E(Z:X; ZX; ) is finite. This is true given (iv.a), as can be verified by 
Cauchy-Schwarz. The desired result V, — V —> 0 now follows by Exercise 
2.35. As V is positive definite given (iii.c), it follows from Proposition 2.30 
that Vz! -V-1—.0. m 

Comparing the conditions of this result with those of Theorem 5.25,we see 
that we have strengthened moment condition (iv.a) and that this, together 
with the other assumptions, implies assumption (v) of Theorem 5.25. An 
immediate corollary of this fact is that the conclusions of Theorem 5.25 
hold under the conditions of Theorem 6.3. 


Corollary 6.4 Suppose conditions (i)-(iv) of Theorem 6.3 hold. Then 


DVA, — Bo) Č N(0,1) 
where 
D = (Q'PQ)'Q'PVPQQ'PQ)™. 
Further, DD > 0, where 
D, = (X'ZP,,Z/X/n?)—1(X'/Z/n)PnVnPn(Z'X/n)(X’ZP,Z’X/n?)-!. 
Proof. Immediate from Theorem 5.25 and Theorem 6.3. m 


The usefulness of this result arises in situations in which it is inappro- 
priate to assume that 


E(e,e/,|Zt) = oI, 


that is, when the errors e; exhibit heteroskedasticity of unknown form. 
The present results thus provide an instrumental variables analog of the 
heteroskedasticity-consistent covariance matrix estimator of White (1980). 
The results of Theorem 6.3 and Corollary 6.4 suggest a simple two-step 
procedure for obtaining the efficient estimator of Proposition 4.45, i.e., 


BX = (X'ZV. ZX)" X/ZV_ ZY. 
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First, one obtains a consistent estimator for 3,, for example, the 2SLS 
estimator, 


Bn = (X'Z(2'Z)-! ZX) X'Z(Z'Z) ZY, 


and forms 
n 
y aa. = xirri 
t=l 


where & = Yt — XÂ. Second, this estimator is then used to compute 
the efficient estimator 37. Because 8% can be computed in this way, it is 
called the two-stage instrumental variables (2SIV) estimator, introduced 
by White (1982). Formally, we have the following result. 


Corollary 6.5 Suppose that 
(i) Y: = X; bo + Et, ES si Bo € RŽ; 

(ii) {(Z;, X}, €:)} is a stationary ergodic sequence; 

(iii) (a) {Zre+, Ft} is a martingale difference sequence; 
(b) E |Ztni€tal? < 00,h=1,...,p,i=1,...,b 
(c) Vn = var(n™!/2Z'e) = var(Z:e:) = V is positive definite; 

(iv) (a) E|ZthiXtnj|? < œ, kh = 1,...,p, i =1,...,1, j =1,...,k, and 
Pli <i fe 2 aap Ai = heal, 
(b) Q = E(Z:X!) has full column rank; 
(c) L = E(Z:Z;) is positive definite. 


Define 


a 


OLDDA 
t=1 
where & = Yı — X:B,, Bn = (X'Z(Z'Z)—!Z!X)— X'Z(Z'Z)— ZY, and 
define 
BY = (X'ZV,, Z'X)X'ZÝ Z"Y. 
Then D-1/2,/n(B* — Bo) 2 N(0,1), where 
D = (QVQ). 
Further, D, <D- 0, where 


D,, = (X'ZŪ I Z'X n?yt. 
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Proof. Conditions (i)-(iv) ensure that Theorem 6.3 holds for 3,,. (Note 
that the second part of (iv.a) is redundant if X; contains a constant.) Next 
set P,, = V}! in Corollary 6.4. Then P = V`}, and the result follows. m 


This result is the most explicit asymptotic normality result obtained 
so far, because all of the conditions are stated directly in terms of the 
stochastic properties of the instrumental variables, regressors, and errors. 
The remainder of the asymptotic normality results stated in this chapter 
will also share this convenient feature. 

We note that results for the OLS estimator follow as a special case upon 
setting Zt = X; and that results for the i.i.d. case follow as immediate 
corollaries, since an i.i.d. sequence is a stationary ergodic martingale dif- 
ference sequence when E(Zie:) = 0. 

Analogous results hold for heterogeneous sequences. Because the proofs 
are completely parallel to those just given, they are left as an exercise for 
the reader. 


Exercise 6.6 Prove the following result. Suppose 
(i) Ye=XiBot+er, t=1,2,..., Bo ER*; 

(ii) {(Zt, Xt, et)} is a mizing sequence with either ¢ of size —r/(2r — 
1),r >1, ora of size —r/(r—1),r > 1; 

(iii) (a) {Zrez, Fz} is a martingale difference sequence; 
(b) E|Zinrietn|? t9 < A < œ for some 6 > 0 and all h = 1,... ,p 
i = 1,... ,l, and t; 
(c) Vn = var(n™!/2Z'e) is uniformly positive definite; 

(iv) (a) E|ZthiXinj|? tE) < A < œ for some 6 > 0 and allh =1,... ,p 
t= hhb] og ay WOT, Gs 
(b) Qn = E(Z'X/n) has uniformly full column rank; 
(c) P, —P, => 0, where P, = O(1) is symmetric and uniformly 
positive definite. 


Then Vn — Vn => 0 and Vz!—Vz1 & 0. 


) 


È] 


Exercise 6.7 Prove the following result. Suppose conditions (i)-(iv)of Ex- 
ercise 6.6 hold. Then D; ValB, — B,) & N(0,1), where 


Dn =(Q,,PnQn) Q,PnVnPnQn(Q,PrQn) *: 
Further, D, —D, a 0, where 


D, = (X’ZP,Z'X/n?)—1(X'Z/n)PpaVnPp(Z'X/n)(X'ZP,Z'X/n?)7?. 
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Exercise 6.8 Prove the following result. Suppose 


(i) Y: = Xib +E, t=1,2,..., B, E€ RF; 

(ii) {(Zi, Xt, €)} is a mizing sequence with either p of size —r/(2r — 1), 
r>1, ora of size —r/(r—1),r > 1; 

(iii) (a) {Zr€z, Ft} is a martingale difference sequence; 
(b) E|ZihiEtn|? 0+8) < A < œ for some 6 > 0 and all h = 1,... ,p 
i = 1,... ,l, and t; 
(c) Vn = var(n™!Z'e) is uniformly positive definite; 

(iv) (a) E|ZthiXinj| 0t < A < œ and E|Zim|? 0t < A < œ, for 
some ô > 0 and all h = 1,... ,p,i = 1,...,l, 7 =1,... ,k, and t; 
(b) Qn = E(Z'X/n) has uniformly full column rank; 
(c) Ln = E(Z'Z/n) is uniformly positive definite. 


) 


Define 
Vn =n! D Ze LE, Li, 
t=1 
where &, = Y, — X/G,,, Ba = (X'Z(Z'Z) 2 X)7*X’Z(Z'Z) "ZY, and 
BX = (X'ZV_ ZX) UX’, ZY. 
Then Dz”? Anale — 3.) > N(0,1), where 


D,= (Qoa Va Qn). 


Further, Dn —-D, Ey 0, where 


~ 


D, = (X/ZV_ ZX /n?)-?, 


This result allows for unconditional heterogeneity not allowed by Corol- 
lary 6.5, at the expense of imposing somewhat stronger memory and mo- 
ment conditions. Results for the independent case follow as corollaries be- 
cause independent sequences are ¢-mixing sequences for which we can set 
r = 1. Thus the present result contains the result of White (1982) as a 
special case but also allows for the presence of dynamic effects (lagged 
dependent or predetermined variables) not permitted there, as well as ap- 
plying explicitly to systems of equations or panel data. 


6.3 Case 2: {Z:e:} Finitely Correlated 147 


6.3 Case 2: {Z,e,} Finitely Correlated 


Here we treat the case in which, for m < ov, 


Vn = nS E(Ziere,Z,) 
t=1 


+n! ` E(Ziere! Z1) T E(Zi—1Et-1€; Z) 
t=2 


n 


tn) XO E(ZiEtE, m2m) + ElZt-mEt-mE,Z4) 
t=m+1 


E ` E(Ziere, Z) 


t=1 


+n Ds 2 E(ZiEtE p-r Zt-7r) T E(Zi—rEt-7€;Z4). 
r=lt=r+1 


Throughout, we shall assume that this structure is generated by a knowl- 
edge that E(Zre:| Ft--) = 0 for 7 =m +1 < œ and adapted o-fields F. 
The other conditions imposed and methods of proof will be nearly identical 
to those of the preceding section. 

First we consider the estimator 


V gol 2 2/7, 
= 2 ZtE1E, Ly 

—1 ~-/ 1 ~ ~Ir! 

+n J J Zee rE, btr + Lr Et_rE, Ly. 


T=1t=7T+1 


It turns out that Vn — Vn 2, 0 under general conditions, as we now 
demonstrate. 


Theorem 6.9 Suppose that 
(i) Y: = Xib, +e t=1,2,..., B, € R5; 
(ii) {(Z;, Xi, €:)} is a stationary ergodic sequence; 
(iii) (a) E(Zire:|Fı--)=0 for T=m+1 < œ and adapted o -fields Fy; 
(b) o <0; halo pi S Towa l 
(c) Vn = var(n71/?Z/e) = V is positive definite; 
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(iv) (a) E |ZeniXinj|? < œ, h=1,...,p,i=1,...,1, j =1,...,k; 
(b) Q = E(Z:X}) has full column rank; 


(c) P, = P, finite, symmetric, and positive definite. 
Then Vn — V ++ 0 and Vz! ~V~! “> 0. 


Proof. By definition and assumption (iii.a), 


Ç z =n -1 J Zčči Z, — E(Zre1€,Z;) 


aY YO (Zeil Ling — Elere, Zi) 
r=] t=r+1 


+ Zt-rEt_ré, Z S E(Zi—rEt-7€; Z). 


If we can show that 


n 
— =-/ / / / Pp = 
n! XO Ziti rZ — E(Ziecer_,Zt_r) — 0, 7 =0,...,m, 
t=T+1 


then the desired result follows by Exercise 2.35. 
Now 


n 
— =-7/ / / / 
no X ZtE:E,_- Zt-r — E(ZtEtEtr Zt) 
t=r+1 


=(= r/min zry? Y Zë, Zi,- BlZuecel rZ). 
t=r+1 


For 7 = 0,... ,m, we have ((n — T)/n) — 1 as n — œ, so it suffices to 
show that for T = 0,... ,m, 


(n = T) =i Zrete By ae a E(Zre1€;_,2Z1_,) at, 0. 
t=r+1 
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As before, replace & with & =e, — X} (Bn — Bo), to obtain 


(n —T)7 Lie Vibe Doe = HOA T A AES A 


t=T+1 
S (n7) IS ZtEtEl rZ, — E(Zr€t€;_-Zt_7) 
t=T+1 
—(n-7T)~ D ZX, ( (Bn — B,)e,_-Z 
t=T+1 
amer 29 Ztet(Bn — Bo) Xt 
t=r+1 
+(n-7) D3 ZX, (Bn — Bo)(Bn — Bo) Xt- Ziz. 
t=r+1 


The desired result follows by Exercise 2.35 if each of the terms above van- 
ishes in probability. 

The first term vanishes almost surely (in probability) by the ergodic 
theorem given (ii) and (ii7.b). For the second, use vec(ABC) = (C’ & 
A)vec(B) to write 


ve (0-07 5S ZX (6 6,637). a) 
t=r+1 


n 


= (n=)! S> (Zi-r€r-7 © ZX) ver(By, — Bo). 


t=r+1 


The conditions of the theorem ensure B,, = Bo by Exercise 3.38, so it 
suffices that (n — T)7* Ptr; (Zt-ret—+ © ZtX}4) is Op(1) by Corollary 
2.36. But this follows by the ergodic theorem, given (iii.b) and (iv.a) upon 
application of Cauchy-Schwarz’s inequality. 

Similar argument establishes that 


we (=r S> Zie(Bn ~ Bo) Xi- Z. a 
t=r+1 


n 


=(n-7)! X (Zi-+Xt_, ® Zet) vec( (Bn — B,)') => 0. 


t=T+1 
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For the final term, use vec(ABC) = (C’ ® A)vec(B) to write 


vec (0-0 y ZtXi(Bn — Bo) (Bn — Bo) Xt- Zi- i 


t=r+1 


= (n-1)! SD (Zi-Xi_, @ ZX!) vee((By - B,)(By - B.)') 


t=r+1 


The conditions of the theorem ensure 3,, > B., while conditions (ii) and 
(iv.a) ensure that the ergodic theorem applies, which implies 


n 


(n=7)! XC (Ze-rXt_, @ ZiXt) = Op(1). 


t=T+1 


The final term thus vanishes by Corollary 2.36 and the proof is complete. 
E 


Results analogous to Corollaries 6.4 and 6.5 also follow similarly. 


Corollary 6.10 r conditions (i)-(iv) of Theorem 6.9 hold. Then 
D-1/2\/n(8,, — 8,) © N(0,1), where 


D = (Q'PQ) 'Q'PVPQ(Q'PQ)”* 
Further, DD- 0, where 
Dn = (X'ZÊ„Z'X/n X'Z/n) bn Vn Ên(Z' X /n)(X' ZÊ, Z X/n?) 
Proof. Immediate from Exercise 5.19 and Theorem 6.9. m 


Corollary 6.11 Suppose that 
(i) Y; = Xibo +e, t=1,2,..., B ER"; 

(ii) {(Z,, X;,€:Ł)} is a stationary ergodic sequence; 

(iii) (a) E(Zret|Ft--) = 0 for T=m+1 < œ and adapted o-fields Fy; 
(b) E|Zenséta|? <o,h=1,. Hg py t= Ay ag li 
(c) Vn = var(n~!/2Z’/e) = V is positive definite; 

(a) EZ ea <6, S oa Dy E seg E cok and 

E\Zini|? < œ, h=1,...,p,i=1,..., 1; 

(b) Q = E(Z:X}) has full column rank; 

(c) L = E(Z:Z)) is positive definite. 


(iv) 
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Define 


<£, 
3 
| 


n 
— > i ET Ay 
t=1 


+0 YO Lede, Zir + Zt-rĒt-r Zh 
T=1t=T+1 
where & = Y; — X/B,, Bn = (X'Z(Z'Z)~!Z'X)-1!X'Z(Z’Z)—1Z/Y, and 
define 


= (X/ZV, ZX) KZV ZY. 
Then D-1/2/n(8* — B,) Č N(0,1), where 
D =(Q’'V-1Q)" 
Further, Dn -D3 0, where 
D, = (X/ZV, ‘ZX /n2)-} 
Proof. Conditions (i)—(iv) ensure that Theorem 6.9 holds for 3,,. Set Ê„ = 


Vs: in Corollary 6.10. Then P,, = = V; 1 and the result follows. m 


Results for mixing sequences parallel those of Exercises 6.6-6.8. 


Exercise 6.12 Prove the following result. Suppose 


(i) Ye= XB +e, = 12,2125 Bo ERF; 
ii Zs, Xi, e:)} is a mizing sequence with either @ of size —r/(2r — 2), 
t Mt 
r> 2, ora of size —r/(r — 2), r > 2; 
(iii) (a) E(Zye,|Fu4_-+) = 0, for T =m+1 < œ and adapted o-fields Fy, 
ea es Sere 
(b) ElZeniéen|"*° < A < œ for some 6 > 0 and all h = 1... P, 
i =1,...,l, andt=1,2,...; 
(c) Vn = var(n71/2 5; Zeer) is uniformly positive definite; 
(iv) (a) Eļ|ZthiXtnj| tf < A < œ for some > 0 and all h = 1,... p, 
i=1,...,l, j =1,...,k, and t= 1,2,...; 
(b) Qn = E(Z'X/n) has uniformly full column rank; 
(c) Ên — Pa —> 0, where P, = O(1) is symmetric and uniformly 
positive definite. 


152 6. Estimating Asymptotic Covariance Matrices 


Then V, —- V, & 0 and pa - V7! RAR 

(Hint: be careful to handle r properly to ensure the proper mixing size 
requirements.) 
Exercise 6.13 Prove the following result. Suppose conditions (i)-(iv) of 
Exercise 6.12 hold. Then Dr? /n(Bn — B,) = N (0,1), where 


Dn = (Qr PnQn) Qn Pn Vn Pnn (Qp Pnn)’. 
Further, ÔD, — D„ & 0, where 
D, = (X'ZP,Z'X /n?)  (X'Z/n) Ên Vn Ên(Z'X/n)(X'ZÊ,Z'X /n?). 
(Hint: apply Theorem 5.23.) 


Exercise 6.14 Prove the following result. Suppose 
(i) Ye =X; bo tEn ¢=1,2,..., Bo ERF; 

(ii) {(Zi, X;,€1)} is a mizing sequence with either p of size —r/(2r — 2), 
r > 2, ora of size —r/(r —2), r > 2; 

(iii) (a) E(Zrez|Fz_-) = 0 for rT = m+1 < œ and adapted o-fields Fr, 
t= 1,2, rey 
(b) E\Zeni€en|"t® < A < œ for some 6 > 0 and all h = 1,...,p 
i= esah and t =A 2yrs 
(c) Vn = var(n—1/2 5>] Zrez) is uniformly positive definite; 

(iv) (a) E|ZthiXtnj|" tf < A < œ for some 6 > 0 and all h = 1,... p, 
O E = 
(b) Qn = E(Z'X/n) has uniformly full column rank; 
(c) Ln = E(Z'Z/n) is uniformly positive definite. 


Define 


7 


“a 


n 
Van = ny Z&Ē Z 
t=1 
m n 
Ay y =~ x/ ' ~ =! 
+n ZtEtEt r ltr + Zi—rEt—7 Er br, 
T=1t=T+1 


where & = Y: — X‘B,,, By = (X'Z(Z'Z)—!Z/X)-1X'Z(Z'Z)“! ZY, and 
de fine 


BX = (X’/ZWnZ/X)-'X’ZV ZY. 
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Then D3? /m(B* — B.) Č N(0,1), where 
Dn = (Qon Vn Qn)’. 
Further, D, — Dn ae 0, where 
Dn = (X/ZV,, Z/X/n?)-. 


Because V,, — Vn => 0 and V,, is assumed uniformly positive definite, 
it follows that V,, is positive definite with probability approaching one as 
n — oo. In fact, as the proofs of the theorems demonstrate, the convergence 
is almost sure (provided Ê -p55 0), so V,, will then be positive definite 
for all n sufficiently large almost surely. 

Nevertheless, for given n and particular samples, V,, as just analyzed can 
fail to be positive definite. In fact, not only can it be positive semidefinite, 
but it can be indefinite. This is clearly inconvenient, as negative variance 
estimates can lead to results that are utterly useless for testing hypotheses 
or for any other purpose where variance estimates are required. 

What can be done? A simple but effective strategy is to consider the 
following weighted version of V,,, 


~ 


n 

— X = =z/ r! 

Va = n 1 LrE rE Ly 
t=1 


m n 
--1 =z =] / = =~/ r! 
+n ) Wnr X ZtEtEt_ rlt- + Ltr Er—-r Ep Ly. 
T=1 t=T+1 


The estimator V, obtains when War = 1 for all n and T. Moreover, con- 
sistency of Vn is straightforward to ensure, as the following exercise asks 
you to verify. 


Exercise 6.15 Show that under the conditions of Theorem 6.9 or Exercise 
6.12 a sufficient condition to ensure Vn — Vn 2, 0 is that Wnr Pj for 
each T = 1,... m. 


Subject to this requirement, we can then manipulate wn, to ensure the 
positive definiteness of V,„ in finite samples. One natural way to proceed is 
to let the properties of V, suggest how to choose wnr. In particular, if Van 
is positive definite (as evidenced by having all positive eigenvalues), then 
set Wnr = 1. (As is trivially verified, this ensures wnr eae 1.) Otherwise, 
choose wp, to enforce positive definiteness in some principled way. For 
example, choose Wp, = f(7,6,) and adjust 6, to the extent needed to 
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achieve positive definite Vn. For example, one might choose wp, = 1—6,T 
(On > 0), Wnr = (On) (On < 1), or Wn, = exp(—9nT) (On > 0). More 


sophisticated methods are available, but as a practical matter, this simple 
approach will often suffice. 


6.4 Case 3: {Z,e;} Asymptotically Uncorrelated 


In this section we consider the general case in which 


Va = n`! ` E(Ziere, Z) 


t=1 


n-1 n 
+n} ` z E(ZtEtEt_ Ziz) + E(Zi—rEt-7€; Z4). 
r=l t=r+1 


The essential restriction we impose is that as T — oo the covariance (corre- 
lation) between Zrez and Zt_;-€:_, goes to zero. We ensure this behavior by 
assuming that {(Z;,X},€z)} is a mixing sequence. In the stationary case, 
we replace ergodicity with mixing, which, as we saw in Chapter 3, implies 
ergodicity. Mixing is not the weakest possible requirement. Mixingale con- 
ditions can also be used in the present context. Nevertheless, to keep our 
analysis tractable we restrict attention to mixing processes. 

The fact that mixing sequences are asymptotically uncorrelated is a con- 
sequence of the following lemma. 


Lemma 6.16 Let Z be a random variable measurable with respect to Fy5.,, 
0 < T < œ, such that ||Z||, = [E|Z|2]!/2 < œ for some q > 1, and let 
l<r<q. Then 


[EIF æ) — F(Z), < Abr)" ZI, 
and 
|| E(Z|F2.) — E(Z)||, < 227 + lal) |Z], 


Proof. This follows immediately from Lemma 2.1 of McLeish (1975). m 


For mixing sequences, (T) or a(T) goes to zero as T — œ, so this result 
imposes bounds on the rate that the conditional expectation of Z, given 
the past up to period n, converges to the unconditional expectation as the 
time separation T gets larger and larger. 

By setting r = 2, we obtain the following result. 
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Corollary 6.17 Let E(Z,) = E(Zn+-) = 0 and suppose var(Zn) < ov, 
and for some q > 2, E|Zn+7|1 < œ, 0 < T < œ. Then 


|[E(ZnZn+r)| < 2¢(7)! "4 (var(Zn)) |Zn+rllq 
and 


\E(ZnZner)\ < AL + ie(r)/2-/4(var(Z,))!/? 


x |Zn+rll, 


Proof. Put F? = o(...,Zn) and FS, = o(Zn+7,-..). By the law of 
iterated expectations, 


E(ZnZn+r) = E(E(ZnZn+4r IFZæ)) 
E(ZnE(Zn+r|F2)) 


by Proposition 3.65. It follows from the Cauchy-Schwarz inequality and 
Jensen’s inequality that 


|E(ZnZn+7)| $ EIRY EE Zarr F 2 
By Lemma 6.16, we have 
E(E(Zntr| F20)? )? < 26r) |Zara 
and 
PP aaia OAA a A eella 


where we set r = 2 in Lemma 6.16. Combining these inequalities yields the 
final results, 


|[E(ZnZn+r)| < Boln) avaa lZn+rlla 
and 
|E(Zn Zn4r)| $22"? + 1a(r)/?-14(var(Zn))? Enel 


The direct implication of this result is that mixing sequences are asymp- 
totically uncorrelated, because ¢(T) — 0 (q > 2) or a(t) — 0 (q > 2) 
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implies |F(Z,Zn47)| — 0 as T — oo. For mixing sequences, it follows that 
Vn might be well approximated by 


Vn = won) > E(Zee€(Z;) 


t=1 
tn X wnr Y E(Zrere,_-Zi_1) + E(Ze—r€e-r€4Z) 
T=1 t=r+1 


for some value m, because the neglected terms (those with m < T < n) will 
be small in absolute value if m is sufficiently large. Note, however, that if 
m is simply kept fixed as n grows, the number of neglected terms grows, 
and may grow in such a way that the sum of the neglected terms does not 
remain negligible. This suggests that m will have to grow with n, so that 
the terms in Vn, ignored by V, remain negligible. We will write m, to 
make this dependence explicit. 

Note that in Vn we have introduced weights w», analogous to those 
appearing in Vn of the previous section. This facilitates our analysis of an 
extension of that estimator, which we now define as 


di 
z 
il 


Wno n` > Z1E1€, Z, 


Mn 


+n S wnr 5 LE 1E,_~Zi_- + Der et_7 EZ}. 


T=1 t=T+1 


We establish the consistency of Vi for Vn by first showing that under 
suitable conditions V, — V, -= 0 and then showing that Vn =V =25:0. 

Now, however, it will not be enough for consistency just to require 
Wnt —— 1, as we also have to properly treat m, — oo. For simplicity, 
we consider only nonstochastic weights Wn, in what follows. Stochastic 
weights can be treated straightforwardly in a similar manner. Our next 
result provides conditions under which V,, — Vn — 0. 


Lemma 6.18 Let {Z,¢} be a double array of random k x 1 vectors such 
that E(|Z)4Znt|"/2) < A < œ for some r > 2, E(Znt) = 0, n,t = 
1,2,..., and {Znt} is mizing with ¢ of size —r/(r—1) ora of size —2r/(r— 
2). Define 


Vn = var(n7}/? ` Zn) 
t=1 
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and for any sequence {Mn } of integers and any triangular array {Wnr : n = 
LQ cad th Sly sae Mn) define 


Vn = won! >) E(ZnrZnt) 
t=1 
ie ee SE (ZatZ ntar) + ji Gil eee a: 
t=r+1 
If mn — œ as n —> W, if |Wnr| < A,n = | ieee T = Lae: ‘Mn, and if 


for each T, Wnr — 1 as n > œ then Vn — Vn > 0. 


Proof. See Gallant and White (1988, Lemma 6.6). m 
Here we see the explicit requirements that Mn — œ and wn; —> 1 for each 
T. 

Our next result is an intermediate lemma related to Lemma 6.19 of White 
(1984). That result, however, contained an error, as pointed out by Phillips 
(1985) and Newey and West (1987), that resulted in an incorrect rate for 
Mn. The following result gives a correct rate. 


Lemma 6.19 Let {Znt} be a double array of random k x 1 vectors such 
that E (|Z Zne") < A < œ for some r > 2, E(Znt) = 0, n,t = 1,2,... 
and {Znt} is mizing with p of size —r/(r — 1) or a of size —2r/(r — 2). 
Define 

Gl = = ZntiZn jt—T,j =B Zr laians): 


If my = o(n!/4) and |wnr| < A, n =1,2,..., 7=1,..., Mn, then for all 
a ee 


te > gi 5 


t=r+1 


Proof. See Gallant and White (1988, Lemma 6.7 (d)). m 


We now have the results e need to ng the consistency of a gen- 
erally useful estimator for var(n7! S77, Z 


Theorem 6.20 Let {Zn:} be a double array 7 random k x 1 vectors such 
that E(\Z.2ni\")< A < œ for some r > 2, E( 2,4) = 0, n,t = 1,2... 
and {Znt} is mizing with ọ of size —r/(r — 1) or a of size —2r/ (r — 2). 
Define 
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and for any sequence {mn} of integers and any triangular array {wnr : n = 
1,2 ny THA ig ea} define 


n 
~ = =| , 
t=1 


ot ae 5 A A E A E AA 


t=r+1 


If Mn > œ as n > CO, Mn = o(n"4), and if |w) < A, n = 1,2,..., 
T=1,... Mn, and if for each T, Wnr > 1 as n —> oo, then V,—Vn —> 0. 


Proof. For V,, as defined in Lemma 6.18 we have 
Va — Vn = (Vn — Vn) + (Vn — Vn). 


By Lemma 6.18 it follows that Vz, — Vn —> 0 so the result holds if V, — 


V, — 0. 
Now 
Vi—-Vn = wan?) LS ZnZn = ElZntZnt) 
=1 

+n" X Wn X ZnZn t-r E(ZntZnt-r) 
r=l t=r+1 

+n! N wnr 3 + eee = EZ 4¢ 725). 
T=1 t=T+1 


The first term converges almost surely to zero by the mixing law of large 
numbers Corollary 3.48, while the next term, which has 7,7 element 


Mn 


n! 
2 Wnr j ZntiZ n,t—T,j = EB 2ruzataes) 
t=r+1 


converges in probability to zero by Lemma 6.19. The final term is the 
transpose of the preceding term so it too vanishes in probability, and the 
result follows. m 

With this result, we can establish the consistency of V,, for the instru- 
mental variables estimator by setting Zn: = Zrez and handling the ad- 
ditional terms that arise from the presence of &; in place of e, with an 
application of Lemma 6.19. 
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Because the results for stationary mixing sequences follow as corollaries 
to the results for general mixing sequences, we state only the results for 
general mixing sequences. 


Theorem 6.21 Suppose 


G Y= Xp Pe, t= 152.04. 8, ER 
(ii) {(Zi, X4,e2)} is a mizing sequence with either ¢ of size —r/(r —1) or 
œ of size —2r/(r — 2), r > 2; 
a) E(Zrez) = 0, b= 1, 25.05% 
b) E|Zini€enl? +) < A < œ for some 6 > 0, h = 1,...,p,1 = 
1,...,l, and all t; 
c) Vn = var(n~!/?2 Sv", Zeer) is uniformly positive definite; 
) eee S < A < œ for some P > 0 andallh=1,. 
;=1,...,l,7=1,...,k, andt =1,2,. 
b) Qn = E(Z’ x /n) has uniformly full « ff ee rank; 


(c) P,-—P,, > 0, where P, = O(1) and is symmetric and uniformly 
positive definite. 


De fine Vn as above. If Mn —> œ as n — œ such that mn = o(ni/4), and 
if |War| < A, n =1,2,..., 7=1,...,mn such that wn, 3 1 as n — œ 
for each T, then Vn — Vn = 0 and Vz! — Vz! & 0. 


Proof. By definition 


n 
yf SÍ ~ ~/ rl 


t=1 


Mn n 
-1 X^ à l =~ -/ / ~ =</WI 
T=1 t=7T+1 


Substituting €; = Et — X! (Bn — ß,) gives 


n 
y — 3 j Irgi 
t=1 
Mn 


nS wnr > Zt€rEy— Li 7 + Zi- rEt_r&y Le Vn + An, 
T=1 t=T+1 
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An = —-Wn0 n`! ` ZiEtl(Bn = Bo) XZ; 


—Wn0 n`! ` ZX (Ba z Bo)EiZi 

t=1 
+wno n~! 2, Z:X1(B, — Bo)(Bn — Bo) XrZy 
—n} 2 Wnr 5> ZE: (ð — Bo) Xt--Zi-7 


t=r+1 


-n= $> umr > ZX; (6 Bn — Bo)Et-r Zt- 


t=r+1 
ta Jun 5 ZX! (Bn — Bo) (Bn — Bo)'Xt-rZi_+ 
t=r+1 
-n Ji wnr 5 Zt-7Et— r( — Bo} XZ; 
t=T+1 
=n! X wnr 5 Zt-7 Xt- (Bn — B,)e,Z; 
T=1 t=T+1 
HENS. tie y Zt— TX- (Bn =. (Bn =, YX:Z;. 
T=1 t=T+1 


The conditions of the theorem ensure that the conditions of Theorem 
6.20 hold for Znt = Zez, so that 


Mn 


—1 
Wno n Ly ZtEr€; Zitn ) Wnr J ZEE, EAE s + Zi-rEt—7€:2 
t=1 T=1 t=T+1 
-Van if. 0. 


The desired result then follows provided that all remaining terms (in An) 
vanish in probability. 

The mixing and moment conditions imposed are more than enough to 
ensure that the terms involving wno vanish in probability, by arguments 
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identical to those used in the proof of Exercise 6.6. Thus, consider the term 


vec " LS war 5 Z1€1(B,, — B, YX- Bie A 


T=1 t=r+1 


a ty Wnr s (Zt_rXi_, 8 Zier) vec((B, — Bo)') 
t=T41 


Mn 


al a” S [( Z:— => O , Q Zret) 


t=T+1 
— E(Z,_,X)_, ® Zre:)|vec((B,, — B,)’) 


Mn 


n~! X` wn a E (Zi-7X}_, ® Zrer) vec( (Bn — Bo)'). 


T=1 t=T+1 


Applying Lemma 6.19 with Zn: = vec(Z:X} © Zret) we have that 


no} 3 wnr X. (Zr-rXt_, ® Zeer) — E(Zr_-+Xt_, Q Zire) = 0. 
t=r+1 


This together with the fact that Dys Bo —> 0 ensures that the first of the 
two terms above vanishes. It therefore suffices to show that 


nS? wnr 5 E (Zi 7X 7 O Zeer) vec((Bn — ßo)’) 
T= trh 


vanishes in probability. 7 
Under the conditions of the theorem, we have \/n(G,, — Bo) = Op(1) (by 
asymptotic normality, Theorem 5.23) so it will suffice that 


n—3/2 ` Wnr ` E (Zt--X;— ie) Zt€t) = o(1). 


t=T+1 


Applying the triangle inequality and Jensen’s inequality we have 


In—3/2 ` Wnr ` E (Zia, Xj, Q Ziet) | 


T=1 t=T+1 
Mn 


< ny. we 3 E|Zt-7X}_, Q Zrer|, 
T=1 t=r+1 
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where the absolute value and inequality are understood to hold element by 
element. Because second moments of ZX; and Zé; are uniformly bounded, 
the Minkowski and Cauchy-Schwarz inequalities ensure the existence of a 
finite matrix A such that E|Z:_,X};_, ® Zrez| < A. Further |w,,| < A 
for all n and 7, so that 


n° war SY) ElZi--X,- © Zel 
T=1 


t=T+1 
< nS lwar|(n—T)A 
T=1 
< n3?m,A(n—T)d 
< nlm, AA. 


But M, = o(n!/4), so n-1/2m,, — 0, which implies 


n=? N war X E(Zi--X;- 9 Zee) = 0(1) 
T=1 t=T+1 


as we needed. 
The same argument works for all of the other terms under the conditions 
given, so the result follows. g 


Comparing the conditions of this result to the asymptotic normality re- 
sult, Theorem 5.23, we see that the memory conditions here are twice as 
strong as in Theorem 5.23. The moment conditions on Z€: are roughly 
twice as strong as in Theorem 5.23, while those on ZX; are roughly four 
times as strong. 

The rate m, = o(n!/*) is not necessarily the optimal rate, and other 
methods of proof can deliver faster rates for Mn. (See, e.g., Andrews, 1991.) 
For discussion on methods relevant for choosing Mp, see Den Haan and 
Levin (1997). 

Gallant and White (1988, Lemma 6.5) provide the following conditions 
ON Wnr guaranteeing the positive definiteness of Vn. 


Lemma 6.22 Let {Zne} be an arbitrary double array and let {ani}, n = 
1,2,...,27=1,...,m,+1 be a triangular array of real numbers. Then for 
any triangular array of weights 


Mnrtl 
Wnr = ò QniIni-r, n = 1,2,... ; a) Mn 
1=T+1 
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we have 


Mn n 


Yn = Wn0 ` Z2, +2 ` Wnr >D ZntZn,t-r > 0. 
t=1 T=1 t=1 


Proof. See Gallant and White (1988, Lemma 6.5). m 
For example, choosing ani = (Mn +1)!/2 for alli = 1,... , Mp +1 delivers 


We = 1 Ho} Gin). TEA dilating 


which are the Bartlett (1950) weights given by Newey and West (1987). 
Other choices for a,; lead to other choices of weights that arise in the 
problem of estimating the spectrum of a time series at zero frequency. (See 
Anderson (1971, Ch. 8) for further discussion.) This is quite natural, as Vn 
can be interpreted precisely as the value of the spectrum of Z€ at zero 
frequency. 


Corollary 6.23 Suppose the conditions of Theorem 6.21 hold. Then 


Dz"? /n(B,, — B.) © N(0,1), 
where 
D, = (QL PanQn) tQ, Pn VnPnQn (QL Parn}. 
Further, Dn —-Dn—> 0, where 
D, = (X'ZÊ„ Z X /n?)7!(X'Z/n)PnVnPa(Z’Z/n)(X’ZP,Z'X/n?)7!. 
Proof. Immediate from Theorem 5.23 and Theorem 6.21. m 


This result contains versions of all preceding asymptotic normality results 
as special cases while making minimal assumptions on the error covariance 
structure. 

Finally, we state a general result for the 2SIV estimator. 


Corollary 6.24 Suppose 
(i) Y: = Xib, + Et, t= l2 B, € R5; 
ii) {(Zi, X! e:)} is a mizing sequence with either o of size —r /(r — 1) or 
tht 
a of size —2r/(r — 2), r > 2; 
(iii) (a) E(Zrer) = 0, t=1,2,...; 
(b) E|Ztni€en|?°*®) < A < 00 for some 6 > 0 and all h =1,...,p 
i=1,...,l, andt =1,2,...: 
(c) Vn = var(n—1/? So, Zeer) is uniformly positive definite; 


, 
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(iv) (a) E|ZpniXing|2° < A < œ and E|Ztni|"+® < A < œ for some 
ô > Did. GH lnk 6 De PS Va gle lcd phy andt =1,2,.. 
(b) Qn = E(Z’X/n) has uniformly full column rank; 
(c) Ln = E(Z’Z/n) is uniformly positive definite. 


Define 
n 
Vn = wWpan? ZEZ 
n= nû ES tO tht 
t=1 
Myn n 
a) > ) ~ x’ 1 ~ ~i ry! 
T=1 t=r+1 


where & = Yı — X$Bn, Bn = (X'Z(Z'Z)"!Z!X)-X/Z(Z/Z) ZY, and 
de fine 
BX = (X'ZV,, ZX)"'X’ZV_ T'Y. 


If m, — œ as n — œ such that m, = o(n'/4) and if |wn;| < A, n = 
1,2,..., 7=1,...,m, such that wz, — 1 as n — œ for each T, then 


Da? yn N(B% — B,) 4 N(0,1), where 
= (Qan Vz Qn) 
Further, D,, — Da, => 0, where 
Dn = (X/ZV,,_Z/X/n?)- 


Proof. Conditions (i)-(iv) ensure that Theorem 6.21 holds for B. Set 
Pn = V7! in Corollary 6.23. Then Pa = V} +, and the result follows. = 
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CHAPTER 7 


Functional Central Limit Theory 
and Applications 


The conditions imposed on the random elements of X;, Yz, and Z: ap- 
pearing in the previous chapters have required that certain moments be 
bounded, e.g., E|X2,,|1+° < A < oo for some 6 > 0 and for all ¢ (Exer- 
cise 5.12 (iv.a)). In fact, many economic time-series processes, especially 
those relevant for macroeconomics or finance violate this restriction. In this 
chapter, we develop tools to handle such processes. 


7.1 Random Walks and Wiener Processes 


We begin by considering the random walk, defined as follows. 


Definition 7.1 (Random walk) Let {4} be generated according to X, = 
At-1 + Zt, t = 1,2,..., where Xo = 0 and {Z+} is i.i.d. with E(Z+) = 0 
and 0 < o? = var(Z) < œ. Then {%} is a random walk. 


By repeated substitution we have 


A = Ai-i + Zt = Air + Zt-1 + 2 


t 
= Xo + ` Zs 
s=1 
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as we have assumed æ = 0. It is straightforward to establish the following 
fact. 


Exercise 7.2 If {%;} is a random walk, then E(X) = 0 and var(%) = 
to?, t =1,2,... 


Because E(%?)!/2 < E(|&,|")* for all r > 2 (Jensen’s Inequality), a 
random walk %; cannot satisfy E|¥?|!t° < A < oo for all t (set r = 
2(1 + 6)). Thus when X:+, Y, or Z; contain random walks, the results of 
the previous chapters do not apply. 

One way to handle such situations is to transform the process so that it 
does satisfy conditions of the sort previously imposed. For example, we can 
take the “first difference” of a random walk to get Z: = Xt — %4_1 (which 
is i.i.d.) and base subsequent analysis on Zz. 

Nevertheless, it is frequently of interest to examine the behavior of es- 
timators that do not make use of such transformations, but that instead 
directly involve random walks or similar processes. For example, consider 
the least squares estimator Ê, = (X’X)~' X'Y, where X is n x 1 with 
Xt = Y;-1, and Y; is a random walk. To study the behavior of such estima- 
tors, we make use of Functional Central Limit Theorems (FCLTs), which 
extend the Central Limit Theorems (CLTs) studied previously in just the 
right way. 

Before we can study the behavior of estimators based on random walks 
or similar processes, we must understand in more detail the behavior of the 
processes themselves. This also directly enables us to understand the way 
in which the FCLT extends the CLT 

Thus, consider the random walk {4}. We can write 


— 
t=1 


Rescaling, we have 


n-V/2x, Jo =n? 5 21/0. 
t=1 
According to the Lindeberg-Lévy central limit theorem, we have 


n7V/2x, Jo -> N(0,1). 


Thus, when n is large, outcomes of the random walk process are drawn 
from a distribution that is approximately normally distributed. 
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Next, consider the behavior of the partial sum 


[an] 


Xan| = 5 Zt, 
t=1 


where 1/n < a < o and [an] represents the largest integer less than or 
equal to an. For 0 < a < 1/n, define Xan} = +o = 0, so the partial sum is 
now defined for all 0 < a < oo. Applying the same rescaling, we define 


Wrla) = n Xano 


[an] 


= ne NPY Zio 
t=1 


Now 


lan] 


Wala) =n? [an]? ¢ [an] P O Zi/o p, 


t=1 


and for given a, the term in the brackets {-} again obeys the CLT and 
converges in distribution to N(0,1), whereas n~!/2 lan} * converges to 
a!/2. It follows from now standard arguments that W,,(a) converges in 
distribution to N(0, a). 

We have written W,(a) so that it is clear that W, can be considered 
to be a function of a. Also, because W,,(a) depends on the Z;’s, it is ran- 
dom. Therefore, we can think of W,,(a) as defining a random function of 
a, which we write W,,. Just as the CLT provides conditions ensuring that 
the rescaled random walk n~!/2%,,/o (which we can now write as W,,(1)) 
converges, as n becomes large, to a well-defined limiting random variable 
(the standard normal), the FCLT provides conditions ensuring that the 
random function Wp converges, as n becomes large, to a well-defined limit- 
ing random function, say W. The word “Functional” in Functional Central 
Limit Theorem appears because this limit is a function of a. 

The limiting random function specified by the FCLT is, as we should 
expect, a generalization of the standard normal random variable. This limit 
is called a Wiener process or a Brownian motion in honor of Norbert Wiener 
(1923, 1924), who provided the mathematical foundation for the theory of 
random motions observed and described by nineteenth century botanist 
Robert Brown in 1827. Of further historical interest is the fact that in his 
dissertation, Bachelier (1900) proposed the Brownian motion as a model 
for stock prices. 
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Before we formally characterize the Wiener process, we note some further 
properties of random walks, suitably rescaled. 


Exercise 7.3 If {X} is a random walk, then %, — %, is independent 
of Xt, — %, for all tı < te < t3 < t4. Consequently, Wn(a4)— Wr(a3) is 
independent of Wn (a2)— Wn(aı) for all a; such that [a;n] = ti, i =1,... ,4. 


Exercise 7.4 For given 0 <a < b< co, Wn(b)— Wn(a) si N(0,b—a) as 
n — OOo. 


In words, the random walk has independent increments (Exercise 7.3) 
and those increments have a limiting normal distribution, with a variance 
reflecting the size of the interval (b — a) over which the increment is taken 
(Exercise 7.4). 

It should not be surprising, therefore, that the limit of the sequence of 
functions {W,,} constructed from the random walk preserves these proper- 
ties in the limit in an appropriate sense. In fact, these properties form the 
basis of the definition of the Wiener process. 


Definition 7.5 (Wiener process) Let (2,.7, P) be a complete probabil- 
ity space. Then W : (0,00) x Q — R is a Wiener process if for each 
a € [0,00), W(a,-) is measurable-F , and in addition 


(i) The process starts at zero: P[W(0,-) = 0) = 1. 


(ii) The increments are independent: If 0 < ag < aj < ... < ak < œ, 
then W(a;,-)— W(aj;-1,-) is independent of W(a;,-)— W(aj;-1,°), j = 
Dect hy Ati JOT GUA Hd a5 Ke: 


(iii) The increments are normally distributed: For0 < a < b < co the 
increment W(b,-) — W(a,-) is distributed as N(0, b— a). 


In the definition, we have written W(a,-) for explicitness; whenever con- 
venient, however, we will write W(a) instead of W(a,-), analogous to our 
notation elsewhere. 

Fundamental facts about the Wiener process are simple to state, but 
somewhat involved to prove. The interested reader may consult Billingsley 
(1979, Section 37) or Davidson (1994, Chapter 27) for further background 
and details. We record the following facts. 


Proposition 7.6 The Wiener process W ezists; that is, there exists a func- 
tion W satisfying the conditions of Definition 7.5. 


Proof. See Billingsley (1979, pp. 443-444). m 
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Proposition 7.7 There ezists a Wiener process W such that for allw E€ Q, 
W(0,w) = 0 and W(.,w) : [0, 00) —> R is continuous on (0, cv). 


Proof. See Billingsley (1979, pp. 444-447). m 


When W has the properties established in Proposition 7.7 we say that 
W has continuous sample paths. (W(-,w) is a sample path.) From now 
on, when we speak of a Wiener process, we shall have in mind one with 
continuous sample paths. 

Even though W has continuous sample paths, these paths are highly 
irregular: they wiggle extravagantly, as the next result makes precise. 


Proposition 7.8 Forw E F, P(F) =1, W(-,w) is nowhere differentiable. 


Proof. See Billingsley (1979, pp. 450-451). m 


With these basic facts in place, we can now consider the sense in which 
Wn converges to W. Because W, is a random function, our available no- 
tions of stochastic convergence for random variables are inadequate. Nev- 
ertheless, we can extend these notions naturally in a way that enables the 
extension to adequately treat the convergence of W,. Our main tool will 
be an extension of the notion of convergence in distribution known as weak 
convergence. 


7.2 Weak Convergence 


To obtain the right notion for weak convergence of Wp, we need to consider 
what sort of function W,, is. Recall that by definition 


[an] 


Wala) = n71? >D Zs/o. 
s=1 


Fix a realization w € Q, and suppose that we first choose a so that for some 
integer t, an = t (i.e., a = t/n). Then we will consider what happens as a 
increases to the value (t + 1)/n. With a = t/n we have [an] = t, so 


t 
Wn (a, w) =n 1/2 S Z.(w)/o, a=t/n. 
s=1 
For t/n <a < (t+ 1)/n, we still have [an] = t, so 


Wr(a,w) = nV?" Z.(w)/o, t/n<a<(t+1)/n. 
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That is, W,(a,w) is constant for t/n < a < (t+1)/n. When a hits (t+1)/n, 
we see that W,,(a,w) jumps to 


é+1 
Wrla,w) = ne Z.(w)/o, a= (t+ 1)/n. 
s=l1 


Thus, for given w, W,(-,w) is a piecewise constant function that jumps 
to a new value whenever a = t/n for integer t. This is a simple example 
of a function that is said to be right continuous with left limit (rell), also 
referred to as cadlag (continue à droite, limites à gauche). It follows that 
we need a notion of convergence that applies to functions W,,(-, w) that are 
rell on [0, co) for each w in 2. 

Formally, we define the rcll functions on [0,0o) as follows. 


Definition 7.9 (rcll) D[0,0o) is the space of functions f : [(0,co) —> R 
that (i) are right continuous and (ii) have left limits (reli): 


(i) ForO<a<o, f(at) = limsa f(b) exists and f(a) = f(a). 
(ii) ForO<a<o, f(a—) = limsa f(b) exists. 


In this definition, the notation limy;, means the limit as b approaches a 
from the right (a < b) while limy;a means the limit as b approaches a from 
the left (b < a). 

The space C0, oo) of continuous functions on [0, 00) is a subspace of 
the space D[0, co). By Proposition 7.7, W(-,w) belongs to C[0,00) (and 
therefore D[0,0o)) for all w E€ Q, whereas W,,(-,w) belongs to D[0, 00) for 
allw € 2. 

Thus, we need an appropriate notion for weak convergence of a sequence 
of random elements of D[0,0o) analogous to our notion of convergence in 
distribution to a random element of R. 

To this end, recall that if {¥,} is a sequence of real-valued random 


numbers then X,, 2+ X if F,,(2) — F(x) for every continuity point x of 
F, where Fy, is the c.d.f. of X, and F is the c.d.f. of X (Definition 4.1). By 
definition, 
Fi{z) = P{w:X%,(w) <x} 
= P{w: X,(w) € Bz} 


and 


E 
L 
| 


P{w: X(w) < x} 
= P{w:X(w) € By}, 
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where B, = (—0co, 2] (a Borel set). Let u„(Bz) = P {w : Xn (w) € Bz} and 
(Bz) = P {w: X(w) € Bz}. Then convergence in distribution holds if 
Hn(Bz) — (Bz) for all sets B, such that zx is a continuity point of F. 

Observe that x is a continuity point of F if and only if s({z}) = 0, 
that is, the probability that VY = z is zero. Also note that x is distinctive 
because it lies on the boundary of the set B}. Formally, the boundary of a 
set B, denoted OB, is the set of all points not interior to B. For Bz, we have 
OB, = {x}. Thus, when z is a continuity point of F, we have (Bz) = 0, 
and we call B, a continuity set of jt; in general, any Borel set B such that 
(OB) = 0 is called a continuity set of p, or a p-continuity set. 

Thus, convergence in distribution occurs whenever 


Hn(B) > u(B) 


for all continuity sets B. Clearly, this implies the original Definition 4.1, 
because that definition can be restated as 


Hn(Bz) > (Bz) 


for all continuity sets of the form B, = (—oo,z]. It turns out, however, 
that these two requirements are equivalent. Either can serve as a definition 
of convergence in distribution. 

In fact, the formulation in terms of generic continuity sets B is ideally 
suited for direct extension from real valued random variables ¥,, and Borel 
sets B (subsets of R) not only to D[0, o0)-valued random functions Wn and 
suitable Borel subsets of D[0, o0), but also to random elements of metric 
spaces generally and their Borel subsets. This latter fact is not just of 
abstract interest — it plays a key role in the analysis of spurious regression 
and cointegration, as we shall see next. 

The required Borel subsets of D{0,co) can be generated from the open 
sets of D[0, c0) in a manner precisely analogous to that in which we can 
generate the Borel subsets of R from the open sets of R (recall Definition 
3.18 and the following comments). However, because of the richness of the 
set D[0,0o) we have considerable latitude in defining what we mean by an 
open set, and we will therefore need to exercise care in defining what an 
open set is. For our purposes, it is useful to define open sets using a metric 
d on D[(0, co). 

Let S be a set (e.g., S = D[0,œ) or S = R). A metric is a mapping 
d: S x S — R with the properties (i) (nonnegativity) d(x,y) > 0 for all 
x,y E S, and d(z,y) = 0 if and only if x = y; (it) (symmetry) d(z,y) = 
d(y,z) for all x,y € S; (iii) (triangle inequality) d(x, y) < d(x,z) + d(z, y) 
for all x,y,z € S. We call the pair (S,d) a metric space. For example, 
d\.(z,y) = |x — y| defines a metric on R, and (R, dj.) is a metric space. 
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By definition, a subset A of S is open in the metric d (d-open) if every 
point z of A is an interior point; that is, for some € > 0, the ¢-neighborhood 
of z, {y E S:d(z,y) < €}, is a subset of A. We can now define the Borel 
sets of S. 


Definition 7.10 Let d be a metric on S. The Borel o-field Sa = B(S,d) 
is the smallest collection of sets (the Borel sets of S with respect to d) that 
includes 


(i) all d-open subsets of S; 
(ii) the complement B° of any set B in Sa; 


(iii) the union U;2, Bi of any sequence {Bi} in Sa. 


Observe that different choices for the metric d may generate different 
Borel o-fields. When the metric d is understood implicitly, we suppress 
explicit reference and just write S = Sg. 

Putting S = Ror S = R* and letting d be the Euclidean metric (d(z, y) = 
lz — y|| = ((z — y) (z — y))"?) gives the Borel o-fields B(R) or B(R*) in- 
troduced in Chapter 3. Putting S = D[0, c0) and choosing d suitably will 
give us the needed Borel sets of D[0, 00), denoted Dag = B(D[0, 00), d). 

The pair (S, Sa) is a metrized measurable space. We obtain a metrized 
probability space (S, Sa, pt) by requiring that 4: be a probability measure on 
(S, Sa). 


We can now give the desired definition of weak convergence. 


Definition 7.11 (Weak convergence) Let n, Hn, n =1,2,... , be prob- 
ability measures on the metrized measurable space (S,S). Then p,, con- 
verges weakly to p, written ft, => H OT fn 2 L if unl A) > ulA) as 
n — œ for all -continuity sets A of S. 


The parallel with the case of real-valued random variables ¥,„ is now clear 
from this definition and our discussion following Definition 7.9. Indeed, 
Definition 4.1 is equivalent to Definition 7.11 when (S, S) = (R, B). 

An equivalent definition can be posed in terms of suitably defined inte- 
grals of real-valued functions. Specifically 4, => p if and only if 


f fain f fap as n — œ 


for all bounded uniformly continuous functions f : S — R. (See Billingsley, 
1968, pp. 11-14). We use Definition 7.11 because of its straightforward 
relation to Definition 4.1. 
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Convergence of random elements on (S, S) is covered by our next defini- 
tion. 


Definition 7.12 Let Vn and V be random elements on (S, S); that is Vn : 
Q — S and V : 2 — S are measurable functions on a probability space 
(0,7, P). Then we say that V, converges weakly to V, denoted Vn > V, 
provided that p, = p, where p (A) = P{w : Va(w) E A} and (A) = 
Plu: V(w)e A} for AES. 


7.3 Functional Central Limit Theorems 


In previous sections we have defined the functions W, and the Wiener 
process W in the natural and usual way as functions mapping [0, 00) x Q to 
the real line, R. Typically, however, the Functional Central Limit Theorem 
(FCLT) and our applications of the FCLT, to which we now turn our 
attention, are concerned with the convergence of a restricted version of 
Wn, defined by 


[an] 
W,(a,w) = EAN Z,(w)/o, O<a<l. 


s=1 


For convenience, we continue to use the same notation, but now we have 
Wn : [0,1] x Q — R and we view W,(a) as defining a random element of 
D(0, 1], the functions on [0, 1] that are right continuous with left limits. 

The FCLT provides conditions under which W, converges to the Wiener 
process restricted to [0, 1], which we continue to write as W. We now view 
W(a) as defining a random element of C'[0, 1], the continuous functions on 
the unit interval. 

In treating the FCLT, we will always be dealing with D[0, 1] and C/O, 1]. 
Accordingly, in this context we write D = D[0,1] and C = C[0,1] to keep 
the notation simple. No confusion will arise from this shorthand. 

The measures whose convergence are the subject of the Functional Cen- 
tral Limit Theorem (FCLT) can be defined as 


H(A) = P{w:Wr(.,w) E A}, 


where A € D = Dy for a suitable choice of metric d on D. 

A study of the precise properties of the metrics typically used in this 
context cannot adequately be pursued here. Suffice it to say that excellent 
treatments can be found in Billingsley (1968, Chapter 3) and Davidson 
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(1994, Chapter 28), where Billingsley’s modification dg of the Skorokhod 
metric ds is shown to be ideally suited for studying the FCLT. The Sko- 
rokhod metric dg is itself a modification of the uniform metric d,, such that 
du(U, V) = supgejo,1) |U(a) — V(a)|, U,V € D. 

The fact that W,(a,-) is measurable for each a is enough to ensure that 
the set {w : W,(-,w) € A} isa measurable set, so that the probability defin- 
ing Hp is well-defined. This is a consequence of Theorem 14.5 of Billingsley 
(1968, p. 121). 

The Functional Central Limit Theorem provides conditions under which 
Hp converges weakly to the Wiener measure pw, defined as 


My (A) = P{w:W(-,w) € ANC}, 


where A € D and C = C(0,1]. When y, > My for H, and py as just 
defined, we say that Wn obeys the FCLT. 

The simplest FCLT is a generalization of the Lindeberg-Lévy CLT, known 
as Donsker’s theorem (Donsker, 1951). 


Theorem 7.13 (Donsker) Let {Z;} be a sequence of i.i.d. random scalars 
with mean zero. If o? = var(Z:) < œ, o? £0, then Wn > W. 


Proof. See the proof given by Billingsley (1968, Theorem 16.1, pp. 137- 
138). m 


Because pointwise convergence in distribution W,(a, -) a W(a,-) for 
each a € (0, 1] is necessary (but not sufficient) for weak convergence (Wn > 


W), the Lindeberg-Lévy Central Limit Theorem (W,,(1, -) A W(1,:)) fol- 
lows immediately from Donsker’s theorem. Donsker’s theorem is strictly 
stronger than Lindeberg-Lévy however, as both use identical assumptions, 
but Donsker’s theorem delivers a much stronger conclusion. 

Donsker called his result an invariance principle. Consequently, the FCLT 
is often referred to as an invariance principle. 

So far, we have assumed that the sequence {Z} used to construct Wn 
is i.i.d. Nevertheless, just as we can obtain central limit theorems when 
{ Zt} is not necessarily i.i.d., so also can we obtain functional central limit 
theorems when {Z+} is not necessarily i.i.d. In fact, versions of the FCLT 
hold for each CLT previously given. 

To make the statements of our FCLTs less cumbersome than they would 
otherwise be, we use the following condition. 


Definition 7.14 (Global covariance stationarity) Let {Z+} be a se- 

quence of k x 1 random vectors such that E(Z, Z+) < œ, t = 1,2,... , and 
n . 

define 3, = var(n~ 1/2 ye 21). If E = limp oo Xn exists and is finite, 
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then we say that {Z,} is globally covariance stationary. We call © the 
global covariance matrix. 


For our CLTs, we have required only that &, = O(1). FCLTs can be 
given under this weaker requirement, but both the conditions and con- 
clusions are more complicated to state. (See for example Wooldridge and 
White, 1988, and Davidson, 1994, Ch. 29.) Imposing the global covariance 
stationarity requirement gains us clarity at the cost of a mild restriction 
on the allowed heterogeneity. Note that global covariance stationarity does 
not require us to assume that E(Z,Z);_,) = E(Z,Z\,_,) for all t, s, and 
T (covariance stationarity). 

We give the following FCLTs. 


Theorem 7.15 (Lindeberg-Feller i.n.i.d. FCLT) Suppose {Z:} satis- 
fies the conditions of the Lindeberg-Feller central limit theorem ( Theorem 
5.6), and suppose that {Z,} is globally covariance stationary. If o? = 
limn—soo var(Wn(1)) > 0, then Wn > W. 


Proof. This is a straightforward consequence of Theorem 15.4 of Billingsley 
(1968). m 


Theorem 7.16 (Liapounov i.n.i.d. FCLT) Suppose { Z;} satisfies the 
conditions of the Liapounov central limit theorem (Theorem 5.10) and sup- 
pose that { Z+} is globally covariance stationary. Ifo? = limp—.oo vat(Wn(1)) 
> 0, then Wn => W. 


Proof. This follows immediately from Theorem 7.15, as the Liapounov 
moment condition implies the Lindeberg-Feller condition. m 


Theorem 7.17 (Stationary ergodic FCLT) Suppose { Z,} satisfies the 
conditions of the stationary ergodic CLT (Theorem 5.16). If o? = limno 
var(Wn(1)) > 0, then Wn => W. 


Proof. This follows from Theorem 3 of Scott (1973). m 


Note that the conditions of the stationary ergodic CLT already impose 
global covariance stationarity, so we do not require an explicit statement 
of this condition. 


Theorem 7.18 (Heterogeneous mixing FCLT) Suppose {Z+} satisfies 
the conditions of the heterogeneous mizing CLT (Theorem 5.20) and sup- 
pose that { Z+} is globally covariance stationary. Ifo? = limn—.oo vat(Wn(1)) 
> 0, then Wn > W. 
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Proof. See Wooldridge and White (1988, Theorem 2.11) m 


Theorem 7.19 (Martingale difference FCLT) Suppose {Zz} satisfies 
the conditions of the martingale difference central limit theorem (Theorem 
5.24 or Corollary 5.26) and suppose that {Z:} is globally covariance sta- 
tionary. If o? = limno var(Wn(1)) > 0, then Wn => W. 


Proof. See McLeish (1974). See also Hall (1977). m 


Whenever {Z} satisfies conditions sufficient to ensure Wp = W, we say 
{Z} obeys the FCLT. The dependence and moment conditions under which 
{ Z;} obeys the FCLT can be relaxed further. For example, Wooldridge and 
White (1988) give results that do not require global covariance stationarity 
and that apply to infinite histories of mixing sequences with possibly trend- 
ing moments. Davidson (1994) contains an excellent exposition of these and 
related results. 


7.4 Regression with a Unit Root 


The FCLT and the following extension of Lemma 4.27 give us the tools 
needed to study regression for “unit root” processes. Our treatment here 
follows that of Phillips (1987). We begin with an extension of Lemma 4.27. 


Theorem 7.20 (Continuous mapping theorem) Let the pair (S,S) be 
a metrized measurable space and let u, p„ be probability measures on (S, S) 
corresponding to V, Vn, random elements of S, n = 1,2,.... (i) Leth: 
S — R be a continuous mapping. If Va => V, then h(V,) = h(V). (ii) 
Let h: S — R be a mapping such that the set of discontinuities of h, 
Dnr = {s E€ S : lims h(r) # h(s)} has (Dnr) = 0. If Vn => V, then 
h(Vn) > h(V). 


Proof. See Billingsley (1968, pp. 29-31). m 


Using this result, we can now prove an asymptotic distribution result for 
the least squares estimator with unit root regressors, analogous to Theorem 
4.25. 


Theorem 7.21 (Unit root regression) Suppose 
(i) Yı = X16, + £1, t = 1,2,... , where Xt = Y¥:_1, B, = 1, and Yo = O; 


(ii) Wn => W, where W, is defined by Wn(a) = n™!/? yen et/o, 0 < 
a < 1, where o? = liMmn—oo var(n™!/? Y; Et) is finite and nonzero. 
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Then 
(a) n=? Era ¥24 > 0? fy W(a)2da. 
If in addition 
(iii) n7! Yi] e2 = 72, 0 < 7? < 0, 
then 
(Oja Drar Ye-1€e = (07/2) (W(1)? = 72/07) ; 


(c) n(Br -1)=> [Jo W(a)?da] fia (1/2) (W(1)? — T?/a?) l 
and 
(d) By => 1. 


Before proceeding to the proof, it is helpful to make some comments 
concerning both the assumptions and the conclusions. 

First, when {Y;} is generated according to assumptions (i) and (ii) of 
Theorem 7.21, we will say that {Y+} is an integrated process. This nomen- 
clature arises because we can view Yn = eS € as an “integral” of et, 
where {e+} obeys the FCLT. Integrated processes are also commonly called 
“unit root” processes. The “unit” root is that of the lag polynomial B(L) 
ensuring the good behavior of et = B(L)¥:; see Hamilton (1994) for back- 
ground and further details. The good behavior of et in this context is often 
heuristically specified to be stationarity. Nevertheless, the good behavior 
relevant for us is precisely that {e+} obeys the FCLT. Stationarity is neither 
necessary nor sufficient for the FCLT. For example, if {e+} is i.i.d., then 
{€¢—€t_-1} is stationary but does not obey the FCLT. (See Davidson, 1998, 
for further discussion.) 

The unit root process is an extension of the random walk, as we do 
not require that {e+} be i.i.d. Instead, we just require that {e+} obeys the 
FCLT. Donsker’s theorem establishes that {e+} i.i.d. is sufficient for this, 
but Theorems 7.15 through 7.19 show that this is not necessary. 

In (i), we have assumed Yo = 0. A common alternate assumption is 
that Yo is some random variable; but (i) implies instead that Yı is some 
random variable (i.e., €), so, after a reassignment of indexes, the situation 
is identical either way. In applications, some statistics may be sensitive to 
the initial value, especially if this is far from zero. A simple way to avoid this 
sensitivity is to reset the process to zero by subtracting the initial value 
from all observations and working with the shifted series Y=Y¥- Yo, 
t =0,1,... , conditional on Yọ. 
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In conclusion (a), the integral fw a)?da ae We interpret this as 


that random variable, say M, aa by M(w = fh W (a, w)?da, w ER. 
For fixed w, this is just a standard ee eo so no new integral 
concept is involved. We now have a situation quite unlike that studied 
previously. Before, n`! $X}; X+X{ — Mn converged to zero, where {Mn} is 
not random. Here, n~? $`% X:X/ = n? oi, Y;2, converges to a random 
variable, o? M. 

In conclusion (b), something similar happens. We have n~! Ja X1€t = 

n~! i- Ye-1€¢ converging to a random variable, (a?/2) (W(1)? — 72/0?) , 
whereas in previous chapters we nee a pe , Azet Converging stochasti- 
cally to zero. Note that W(1)? is x? (chi-squared with one degree of free- 
dom) so the expectation of this limiting random variable is 


E ((07 /2) (wa)? = T?/0°)) = (a° /2) (1-— r/o") i 


from which we see that the expectation is not necessarily zero unless T 
o?, as happens for the cases in which {e+} is an independent or martingale 
difference sequence. 

Together, (a) and (b) imply the asymptotic distribution results for the 
least squares estimator, conclusion (c). Several things are noteworthy. First, 
note that the scale factor here is n, not y/n as it previously has been. Thus, 
Bn is “collapsing” to its limit at a much faster rate than before. This is 
sometimes called superconsistency. Next, note that the limiting distribution 
is no longer normal; instead, we have a distribution that is a somewhat 
complicated function of a Wiener process. When 7°? = ø? (independent or 
martingale difference case) we have the distribution of J. S. White (1958, 
p. 1196), apart from an incorrect scaling there, as noted by Phillips (1987). 
For 7? = o?, this distribution is also that tabulated by Dickey and Fuller 
(1979, 1981) in their famous work on testing for unit roots. 

In the regression setting studied in previous chapters the existence of se- 
rial correlation in ez in the presence of a lagged dependent variable regres- 
sor leads to the inconsistency of 3,, for G,, as discussed in earlier chapters. 
Here, however, the situation is quite different. Even though the regressor 
is a lagged dependent variable, (,, is consistent for 6, = 1 (conclusion (d)) 
despite the fact that conditions 7.21 (ii) and (iii) permit {e+} to display 
considerable serial correlation. 

The effect of the serial correlation is that T? £ o?. This results in a shift 
of the location of the asymptotic distribution away from zero, relative to 
the 7T? = o? case (no serial correlation). Despite this effect of the serial 
correlation in {£+}, we no longer have the serious adverse consequence of 


2 — 


inconsistency of (3. 
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One way of understanding why this is so is succinctly expressed by 
Phillips (1987, p. 283): 
Intuitively, when the [data generating process] has a unit root, 
the strength of the signal (as measured by the sample variation 
of the regressor Y;_;) dominates the noise by a factor of O(n), so 
that the effects of any regressor-error correlation are annihilated 
in the regression as n — oo. [Notation changed to correspond.] 


Note, however, that even when T? = ø? the asymptotic distribution given 
in (c) is not centered about zero, so an asymptotic bias is still present. The 
reason for this is that there generally exists a strong (negative) correlation 


between W(1)? and ts W(a)?da]~/, resulting from the fact that W(1)? 
and W(a)? are highly correlated for each a. Thus, even though E(W(1)? — 
T?/a?) = 0 with 7? = o? we do not have 


E (if W(a)*da]~*(1/2)(W(1)? — 7*/02)) = 0, 


See Abadir (1995) for further details. 
We are now ready to prove Theorem 7.21, using what is essentially the 


proof of Phillips (1987, Theorem 3.1). 

Proof of Theorem 7.21. (a) First rewrite n~? )-_, ¥2., in terms of 
Wr(at-1) = n71 ?Y, -1/0 = n712 A €;/o, where at—ı = (t — 1)/n, so 
that n~? 37, Y1 = o?n-1 OL Wr(at-1)?. Because W, (a) is constant 
for (t — 1)/n <a < t/n, we have 


n n t/n 
is SO Wala)? = ey, Wn (a)?da 
t=1 t=] Y (t—1)/n 


= [ W, (a)2da. 


The continuous mapping theorem applies to h(W,) = z Wn(a)?da. It 


follows that h(Wn) > h(W), so that n~? Dy, Y£] > 0? fo W(a)?da, as 
claimed. 
(b) Because ¥;_1 = on!/2W,,(ar_1), we have 


n n 
z —1/2 
n`! ` Y;_16, = on! ` Wn(at—1)Et. 
t=1 t=1 


Now Wn (at) = Wn(at-1) + n7)/2e4/a, so 
Wn(az)? = Wn(at—1)? + ne? /o? + 2n-/2W, (at—1)et/0. 
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Thus on™ 12W, (ax_1 ee = (1/2)0? (Wn (at)? — Wr(at_1)?) — (1/2)n71e?. 
Substituting and summing, we have 


nS Yie = (07/2) X Wala)? — Wr lat)? 
t=1 t=1 


=(1/2)n™ Sie 


(07/2) Wa (1)? = (1/2)n™? X` e?. 
t=1 
By Lemma 4.27 and the FCLT (condition 7.21 (ii)), we have W,(1)? => 


W(1)?; by condition 7.21 (iii) we have n~! S*?_, e? -> 7?. It follows by 
Lemma 4.27 that 


AS ae => (07/2)w(1)? — r?/2 


= (@?/2) (wa)? — EIE) 
(c) We have 


a 


UB, = 1) 


lj 
3 
[N 
a. 
I 3 
bas 
] 
[7 
L 
iM: 
x 
L, 
x 
| 
ee. 
iM 
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hae 
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L 
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Conclusions (a) and (b) and Lemma 4.27 then give the desired result: 
n 1 n 
E C WY an 
t=1 t=1 


= ([ mea) T (1/2) (W0)? = 72/02), 


(d) From (c), we have n(Z,,—1) = Op(1). Thus, 8,,—1 = n-!(n(@,,-1)) = 
o(1)Op(1) = 0p(1), as claimed. m 
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In the proof of (a), we appealed to the continuous mapping theorem 
to establish that i Wpn(a)?da = i W(a)?da. The next exercise asks you 
to verify this and then gives a situation where the continuous mapping 
theorem does not apply. 


Exercise 7.22 For U,V € D, put du(U, V) = supgejo,1) |U (a) — V(a)|, and 
consider the metrized measurable spaces (D,du) and (R, |-|), the mapping 
Mı (U) = U?, and the two functionals M2(U) = io U(a)?da and M3(U) = 
i log |U(a)|da. 

(i) Show that Mı : (D,dy) > (D, du) is continuous at U for any U €C, 
but not everywhere in D. 

(ii) Show that Mz2:(D,du) — (R, |-|) is continuous at U for any U € C, 
so that fọ Wn(a)2da = J) W(a)?da. 

(iii) Show that M3 : (D,du) — (R, |-|) is not continuous everywhere in 
C, and fy log(|Wn(a)|)da # fy log(|W(a)|)da. 


Phillips (1987, Theorem 3.1) states a version of Theorem 7.21 with spe- 
cific conditions corresponding to those of Theorem 7.18 (heterogeneous 
mixing FCLT) ensuring that conditions 7.21 (ii) and (iii) hold. Clearly, 
however, many alternate versions can be stated. The following exercise 
asks you to complete the details for Phillips’s result. 


Exercise 7.23 Use Theorem 7.18 and a corresponding law of large num- 
bers to state conditions on {e+} ensuring that 7.21 (it) and 7.21 (iii) hold, 
with conclusions 7.21 (a) — (d) holding in consequence. What conditions 
suffice for T? = a2? 


Phillips also gives a limiting distribution for the standard t-statistic cen- 
tered appropriately for testing the hypothesis 6, = 1 (the “unit root hy- 
pothesis” ). Not surprisingly, this statistic no longer has the Student-t dis- 
tribution nor is it asymptotically normal; instead, the t-statistic limiting 
distribution is a function of W, similar in form to that for n(8,, — 1). 

Although the nonstandard distribution of the t-statistic makes it awk- 
ward to use to test the unit root hypothesis (Ho: G, = 1), a simple re- 
arrangement leads to a convenient x2 statistic, first given by Phillips and 
Durlauf (1986, Lemma 3.1 (d)). From 7.21 (b) we have that 


(Er) Ga-N = aE Ye 
t=1 t=1 


=> (07/2) (W(1)? — 77/07). 
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Thus when G, = 1 
(2/07) (= orea) (Bn — 1) + 77/0? => W(1)? = xå. 
t=1 


This statistic depends on the unknown quantities o? and r?, but estimators 
ô? and ?? consistent for o? and 7, respectively, under the unit root null 
hypothesis can be straightforwardly constructed using e+ = Y; — Yı. In 
particular, set 72 = n`! 5} ] e2 and form ĉ2 using Theorem 6.20. We 
then have 


(2/63) (r S 1) @ n —1)+ 72/67 => xi, 


under the unit root null hypothesis, providing a simple unit root test pro- 
cedure. 

Despite the convenience of the Phillips—Durlauf statistic, it turns out to 
have disappointing power properties, even though it can be shown to rep- 
resent the locally best invariant test (see Tanaka, 1996, pp. 324-336). The 
difficulty is that its optimality is typically highly localized. In fact as Elliott, 
Rothenberg, and Stock (1996) note, there is no uniformly most powerful 
test in the present context, in sharp contrast to the typical situation of 
earlier chapters, where we could rely on asymptotic normality. Here, the 
limiting distribution is nonnormal. In consequence, there is a plethora of 
plausible unit root tests in the literature, and our discussion in this section 
has done no more than scratch the surface. Nevertheless, the basic results 
just given should assist the reader interested in exploring this literature. 
An overview of the literature is given by Phillips and Xiao (1998); of par- 
ticular interest are the articles by Dickey and Fuller (1979, 1981), Elliott 
et al. (1996), Johansen (1988, 1991), Phillips (1987), Phillips and Durlauf 
(1986), and Stock (1994, 1999). 


7.0 Spurious Regression, Multivariate Wiener 
Processes, and Multivariate FCLTs 


Now consider what happens when we regress a unit root process Y; = Y1 + 
€+ not on its own lagged value ¥;_,, but on another unit root process, say 
X, = Xt-1 +. For simplicity, assume that {7,} is i.i.d. and independent 
of the i.i.d. sequence {e+} so that {Y;} and {X;} are independent random 
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walks, and as before we set Xo = Yo = 0. We can write a regression equation 
for Y; in terms of X; formally as 


Y; = X18, + Ut, 


where 8, = 0 and u: = Y;, reflecting the lack of any relation between Y; 
and Xz. 
We then ask how the ordinary least squares estimator 


n “lin 
ce = (£x) SO XY: 
t=1 t=1 


behaves as n becomes large. As we will see, Bn is not consistent for B, = 
0 but instead converges to a particular random variable. Because there 
is truly no relation between Y} and X+, and because a is incapable of 
revealing this, we call this a case of “spurious regression.” This situation 
was first considered by Yule (1926), and the dangers of spurious regression 
were forcefully brought to the attention of economists by the Monte Carlo 
studies of Granger and Newbold (1974). 
To proceed, we write 


t 
Winlat) = n712 Xo =n! X n0, 


s=1 


t 
Wan (az) = n—/2Y, /o2 = nV? “es/02, 


s=l 


where o? = limpo var(n7!/? Y7 m) and o3 = limn oo var (n7 1/2 Yp 
Er), and ay = t/n as before. Substituting for X; and Y; and, for convenience, 
treating 3,,_, instead of 6„ we can write 


n -l n 
har = (s>x2.) So XY 
t=1 t=1 


n =l n 
= (oi > wata) 01022 X Win(ar—1)Wen(at-1) 
t=1 


t=1 


(o2/01) fo SMin(ae? n`! DD Win(@t-1) Won(az-1). 


t=1 
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From the proof of Theorem 7.21 (a), we have that 
n 1 
n`! XO Win(ae-1)? = f W, (a)*da, 
t=1 0 
where YW, is a Wiener process. We also have 


n n t/n 
nt ` Win(ae—1) Won (@t-1) = 5 Win(@)Won(a)da, 


t=1 t=1 Y (t—1)/n 


because Wın(a)Wən(a) is constant for (t — 1)/n < a < t/n. Thus 


n 1 
n`! S Win(at-1)W2n(at-1) =f Win(a)Wan(a)da. 


i=1 


We might expect this to converge in distribution to J W,(a)W2(a)da, 
where W, and Wp are independent Wiener processes. (Why?) If so, then 
we have 


By = (02/01) j Wi (a)?da! . [ W, (a) W2(a)da, 


which is a nondegenerate random variable. Bn is then not consistent for 
Bo = 0, so the regression is “spurious.” 

Nevertheless, we do not yet have all the tools needed to draw this con- 
clusion formally. In particular, we need to extend the notion of a uni- 
variate Wiener process to that of a multivariate Wiener process, and for 
this we need to extend the spaces of functions we consider from C[0, co) 
or D[0,0o) to Cartesian product spaces C*[0,0o) = x*_,C[0,oo) and 
D*(0, 00) = x*_, D[0, ov). 


Definition 7.24 (Multivariate Wiener process) W = (W1, ... , We)’ 
is a multivariate (k-dimensional) Wiener process if W1, ... , We are inde- 
pendent (IR-valued) Wiener processes. 


The multivariate Wiener process exists (e.g., Breiman, 1992, Ch. 12), has 
independent increments, and has increments W(a,-) — W(b, -) distributed 
as multivariate normal N(0, (a — b)I), 0 < b < a < œ, as is straight- 
forwardly shown. Further, there is a version such that for all w € Q, 
W(0,w) = 0 and W(-,w) : [0,00) — R* is continuous, a consequence 
of Proposition 7.7 and the fact that W is continuous if and only if its 
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components are continuous. Thus, there is a version of the multivariate 
Wiener process taking values in C*[0, co), so that W is a random element 
of OF [0, co). As in the univariate case, when we speak of a multivariate 
Wiener process, we shall have in mind one with continuous sample paths. 
When dealing with multivariate FCLTs it will suffice here to restrict at- 
tention to functions defined on [0,1]. We thus write C% = x%_,C[0,1] and 
DF = xf _, D0, 1]. 

Analogous to the univariate case, we can define a multivariate random 
walk as follows. 


Definition 7.25 (Multivariate random walk) Let X; = 1-1 + Zt, 
t=1,2,..., where Xo =O (k x 1) and {Z:+} is a sequence of i.i.d. k x 1 
vectors Z, = (Zu,..., Zek}! such that E(Z+) = 0 and E(Z:Z;) =X, a 
finite positive definite matriz. Then {Æ+} is a multivariate (k-dimensional) 
random walk. 


We form the rescaled partial sums as 


[an] 


W,,(a) = E7 1?n-!/2 >D Z. 


t=1 
The components of W, are the individual partial sums 


[an] 


Wag (a) =n} Ze, j=1,... ik; 


t=1 


where Žij is the jth element of -1/2 Z,. For given w, the components 
W,j(-;w) are piecewise constant, so Wp is a random element of D*. 

We expect that a multivariate version of Donsker’s theorem should hold, 
analogous to the multivariate CLT, so that W, = W. To establish this 
formally, we study the weak convergence of the measures p,,, defined by 


Ha (A) = P{w: Wnh w) € A}, 


where now A € B(D*,d) for a suitable choice of metric d on D*. For 
example, we can choose d(z,y) = Di- da (2;,4;) for © = (2j,... , 2k)’, 
y = (y1,---, Yk)’, zj Yj E D with dg the Billingsley metric on D. 

The multivariate FCLT provides conditions under which p,, converges to 
the multivariate Wiener measure pw defined by 


My (A) = P{w:W(.,w) E€ ANC*}, 
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where A € B(D*,d). When p, > pw we say that Wn => W and that Wn 

obeys the multivariate FCLT, or that {Z+} obeys the multivariate FCLT. 
To establish the multivariate FCLT it is often convenient to apply an 

analog of the Cramér-Wold device (e.g., Wooldridge and White, 1988). 


Proposition 7.26 (Functional Cramér-Wold device) Let {Vn} be a 
sequence of random elements of DF and let V be a random element of 
DF (not necessarily the Wiener process). Then Van => V if and only if 
XVn = X'V for all XE RF, NA=1. 


Proof. See Wooldridge and White (1988), proof of Proposition 4.1. m 


Applying the functional Cramér-Wold device makes it straightforward to 
establish multivariate versions of Theorems 7.13 and 7.15 through 7.19. To 
complete our discussion of spurious regression, we state the multivariate 
Donsker Theorem. 


Theorem 7.27 (Multivariate Donsker) Let {Z:+} be a sequence of i.i.d. 
kxl vectors Zi = (Zu, ... , Zek} such that E(Z:) = 0 and E(Z:Z;) = Ð, 
a finite nonsingular matriz. Then Wn => W. 


Proof. Fix A € R*, XA = 1. Then A’W,,(a) = n712 Dai N12 Z,, 
where {A'S}? Z} is i.i.d. with E(\’D"1/*Z,) = 0 and 


var(N DO /?2Z,) = XNE? E(ZiZi) DA 
XD LEE VANAS. 


The conditions of Donsker’s Theorem 7.13 hold, so A’W,, => W, the uni- 
variate Wiener process. Now W = A'W and this holds for all A € RF, 
A'A = 1. The result then follows by the functional Cramér-Wold device. m 


At last we have what we need to state a formal result for spurious re- 
gression. We leave the proof as an exercise. 


Exercise 7.28 (Spurious regression) Let {X:+} and {Y;} be indepen- 

dent random walks, Xt = X1-1+™ and Y; = ¥i-1 + et, where o? = E(n?) 
: =l 

and 02 = E(e2). Then B, > (02/01) | Fs W. (a)?aa] Jo Wila)Wz(a)da, 

where W = (Wi, W27 is a bivariate Wiener process. (Hint: apply the 


multivariate Donsker Theorem and the continuous mapping theorem with 
SSD?) 


Not only does ĝ„ not converge to zero (the true value of G,) in this 
case, but it can be shown that the usual t-statistic tends to oo, giving the 
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misleading impression that B, is highly statistically significant. See Watson 
(1994, p. 2863) for a nice discussion and Phillips (1986) for further details. 

For each univariate FCLT, there is an analogous multivariate FCLT. This 
follows from the following result. 


Theorem 7.29 (Multivariate FCLT) Suppose that {Z:+} is globally co- 
variance stationary with mean zero and nonsingular global covariance ma- 
trir © such that for each A € RE, XNA =1, (AXEZ) obeys the FCLT. 
Then Wn > W. 


Proof. Fix A € R*, XA = 1. Then ’W,,(a) = n=? yl ND Z,. 
Now E(A’D" 1/22.) = 0 and var(A’W,,(1)) —> 1 as n — oo. Because 
{A'E 1/2Z,} satisfies the FCLT, we have YW, > W = A'W. This 
follows for all A € R*, A'A = 1, so the conclusion follows by the functional 
Cramér-Wold device. @ 


Quite general multivariate FCLTs are available. See Wooldridge and 
White (1988) and Davidson (1994, Ch. 29) for some further results and dis- 
cussion. For example, here is a useful multivariate FCLT for heterogeneous 
mixing processes, which follows as an immediate corollary to Wooldridge 
and White (1988, Corollary 4.2). 


Theorem 7.30 (Heterogeneous mixing multivariate FCLT) Let 
{Znz} be a double array of k x 1 random vectors Znt = (Znt,.-- , Zntk)’ 
such that {Zp} is mizing with ¢ of size —r /(2r —2) or a of size —r/(r—2), 
r > 2. Suppose further that E|Zntj| < A < œ, E(Zntj) =0,n,t =1,2,..., 
j =1,...,k. If {Z2nt} is globally covariance stationary with nonsingular 
global covariance matriz X = limn-so0 var(n7!/? Sve, Znt), then Wn > 


W. 


Proof. Under the conditions given, A'E! Zn: obeys the heterogeneous 
mixing FCLT, Theorem 7.18. As Zm is globally covariance stationary, the 
result follows from Theorem 7.29. See also Wooldridge and White (1988, 
Corollary 4.2). m 


Our next exercise is an application of this result. 


Exercise 7.31 Determine the behavior of the least squares estimator B, 
when the unit root process Y} = Y:-1 + c: is regressed on the unit root 
process X_ = X1:-1 + ™, where {n,} is independent of {et}, and {n,,€2} 
satisfies the conditions of Theorem 7.30. 
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Continuing with the spurious regression setup in which {X+} and {Y;} are 
independent random walks, consider what happens if we take a nontrivial 
linear combination of X+ and Y}: 


a1Yı + agXz = 1 ¥¢-1 + a2Xt-1 + G1 Ey + ah, 
where a, and ag are not both zero. We can write this as 
Zt = Zt-1 + Vt, 


where Zt = a1 Y; + agX_ and ve = ait + aom. Thus, Z: is again a random 
walk process, as {vz} is i.i.d. with mean zero and finite variance, given that 
{ez} and {7,} each are i.i.d. with mean zero and finite variance. No matter 
what coefficients a; and az we choose, the resulting linear combination is 
again a random walk, hence an integrated or unit root process. 

Now consider what happens when {X+} is a random walk (hence inte- 
grated) as before, but {Y;} is instead generated according to Y, = X16,+€:, 
with {e+} again i.i.d. By itself, {Y;} is an integrated process, because 


Y; — Yee = (Xt — X11) Bo + Et — Et-1 


so that 


Yı Yt-1 + 140, + Et — Et-1 


Yı—ı T Çt» 


where C; = 7:3, + Et — €t-1 is readily verified to obey the FCLT. 

Despite the fact that both {Xz} and {Y;} are integrated processes, the 
situation is very different from that considered at the outset of this section. 
Here, there is indeed a linear combination of X; and Y, that is not an 
integrated process: putting a1 = 1 and ag = —@, we have 


a1 Yı + a2Xı = Yı — BoXt = Et, 
which is i.i.d. This is an example of a pair { Xz, Y;} of cointegrated processes. 


Definition 7.32 (Cointegrated processes) Let X = (Xy,... , Vin)! 
be a vector of integrated processes. If there erists a k x 1 vector a such 
that the linear combination {Z, = a'Xı} obeys the FCLT, then X+ is a 
vector of cointegrated processes. 
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Cointegrated processes were introduced by Granger (1981). This paper 
and that of Engle and Granger (1987) have had a major impact on modern 
econometrics, and there is now a voluminous literature on the theory and 
application of cointegration. Excellent treatments can be found in Johansen 
(1988, 1991, 1996), Phillips (1991), Sims, Stock, and Watson (1990) and 
Stock and Watson (1993). 

Our purpose here is not to pursue the various aspects of the theory of 
cointegration. Instead, we restrict our attention to developing the tools 
necessary D aise the behavior of the least squares estimator Ê, 

(Sop XXi)! SO, Kt Ye, when (Xz, Y+) is a vector of cointegrated pro- 
cesses. 

For the case of scalar X+, 


a 


Br 


oD KV 2 XXe +e) 


Bo +O X2- YXer 


t=1 


For now, we maintain the assumption that {X+} is a random walk and that 
{ez} is i.i.d., independent of {7,} (which underlies {X;,}). We later relax 
these requirements. 
From the expression above, we see that the behavior of ,, is determined 
by that of ia X? and pee Xet, SO we consider each of these in turn. 
First consider )>)_, X?. From Theorem 7.21 (a) we know that 


n 1 
n7? pe. => of | W,(a)*da, 
t=1 0 


where o? = E(7?). 
Next, consider ae Xer. Substituting X; = Xt—-1 +7, we have 


n n n 
= À =) —1 X 

n j Xtet =n X1—1€t +n MEt. 
t=1 t=1 t=1 


Under our assumptions, {7,€¢} obeys a law of large numbers, so the last 
term converges to E(7,€1) = 0. We thus focus on n~! Soy, Xt_1€¢. If we 
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write 


t 
Win (ae) = aS n,o 
s=1 


t 
Wan (at) z n 1/2 ` Es/02, 
s=1 


where, as before, we set a; = t/n, we have X;_1 = n!/?o1Win( a1) and 
Et = n1203 (Won (at) — Wen(at_1)) . Substituting these expressions we ob- 
tain 


n n 
n7! y Xtejer =n! ) n!/201Wın(at-1) 
t=1 


t=1 


x n!/202 (Wan (at) — Wen(at-1)) 


= 0102 Ds Win(at—1) (Wen(at) — Wan(at-1))- 


t=1 


The summation appearing in the expression immediately above has a pro- 
totypical form that plays a central role not only in the estimation of the 
parameters of cointegration models, but in a range of other contexts as 
well. 

The analysis of such expressions is facilitated by defining a stochastic 
integral as 


/ WindWen = > Win(ae—1) (Wen(ae) — Won (at~1)): 


t=1 


Writing the expression in this way and noting that Wn = (Win, Wan)! => 
W = (Wi, W2)’ suggest that we might expect that 


1 1 
| WindWon => i W,dWa, 
0 0 


where the integral on the right, involving the stochastic differential dW, 
will have to be given a suitable meaning, as this expression does not corre- 
spond to any of the standard integrals familiar from elementary calculus. 
Under certain conditions this convergence does hold. Chan and Wei 
(1988) were the first to establish a result of this sort. Nevertheless, this 
result is not generally valid. Instead, as we describe in more detail below, a 
recentering may be needed to accommodate dependence in the increments 
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of Win or Wen not present in the increments of W, and W2. As it turns 
out, the generic result is that 


1 1 
f Wind Wan = An => J W, dW, 
0 0 


where A, performs the recentering. 

We proceed by first making sense of the integral on the right, using 
the phenomenally successful theory of stochastic integration developed by 
Ito in the 1940’s; see Ito (1944). We then discuss the issues involved in 
establishing the limiting behavior of Io Wind Wan. 

To ensure that the stochastic integrals defined below are properly be- 
haved, we make use of the notion of the filtration generated by W. 


Definition 7.33 (Filtration generated by W) Let W be a standard 
multivariate Wiener process. The filtration generated by W is the sequence 
of o -fields {F+, t € [0,00)}, where Fe =o(W(a), 0<a <t). 


Note that {F} is an increasing sequence of o-fields (Fa C Ft, a < t). In 
what follows, the o-field F; will always denote that just defined. 

We can now define the random step functions that provide the foundation 
for Ito’s stochastic integral, parallel to the construction of the familiar 
Riemann integral. 


Definition 7.34 (Random step function) Suppose there is a finite se- 
quence of real numbers 0 = ap < a, < ++: < an and a sequence of redom 
variables n,, with E(n?) < co, where n, is adapted to Fi, t =0,...,n—1. 
Let f : (0,00) x 2 — R be defined so that f(a,-) = m for at < a < at+ı. 
Then f is a random step function. 


The Ito stochastic integral is straightforward to define for random step 
functions. 


Definition 7.35 Let W be a component of W. The Ito stochastic integral 
of a random step function f is defined as 


T(E) =X ne- Wla) — W(ar-1)). 


We also write Z(f) = fg fdW or Z(f) = fo” f(a)dW(a) in this case. 
The Ito stochastic integral of a random step function has finite second 
moment: 
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Proposition 7.36 If f is a random step function, then 
E(T(f}?) = EC f(a)Pda) < oo. 
0 


Proof. See Brzeźniak and Zastawniak (1999, pp. 182-183). m 


To extend the definition of the Ito stochastic integral to a class of func- 
tions wider than the random step functions, we use the random step func- 
tions as approximations, analogous to the construction of the Riemann 
integral. Specifically, we define the Ito stochastic integral for functions well 
approximated by random step functions in mean square. We first define 
the class of square integrable stochastic functions. 


Definition 7.37 (Square integrable (adapted) stochastic function) 
Let f :(0,00)x2 — R be such that f(a, -) is measurable for eacha € (0, 00). 
IfE ( ‘Px f (a)?da) < œ, then f is a square integrable stochastic function. 
If in addition f(a,-) is measurable-F for eacha € [0,00) then f is a square 
integrable adapted stochastic function. 


For square integrable stochastic functions, we have E ( a f(a)?da) = 
te E (f(a)?) da. (Why?) 


Definition 7.38 (Approximatable stochastic function) Let f be a 
square integrable adapted stochastic function such that there is a sequence 
{fn} of random step functions such that 


0o 
lim E (| |f(a) — fn(a) da =); 
n--+00 0 
Then f is an approximatable stochastic function and {fn} is a sequence 
of approximating step functions for f. 


A general definition of the Ito stochastic integral can now be given. 


Definition 7.39 (Ito stochastic integral) Suppose f is an approrimat- 
able stochastic function. If for any sequence {fn} of approrimating step 
functions for f there exists a random variable T(f) with E(I(f)) < œ 


and such that limp E (ITU) — T) = 0, then we call T(f) the Ito 


stochastic integral and write 


x(f) = | ~ f(a) dW(a) = i ~ faw. 
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Proposition 7.40 For every approzimatable stochastic function the Ito 
stochastic integral exists, is unique (to within equality, a.s.), and satisfies 


E(Z(f)?)=E ( i ~ f(a)*da) 


Proof. See Brzeźniak and Zastawniak (1999, pp. 184-185). m 

In the light of this result, if f is an approximatable stochastic function, 
we shall also call f a stochastically integrable function. 

If lib) f is stochastically integrable, we can define 


f Kowa = f tpolatan(a 
= T(lp,c)f), 
where 1,.)(a) = 1 ifb < a < c and is zero otherwise. We also write 


T(1p,c)f) = fy FAW. 
It is not always easy to check that a function f is stochastically inte- 
grable. The next result establishes that f is stochastically integrable if it 


is continuous almost surely and properly adapted. 


Proposition 7.41 Let f be a stochastic function with continuous sample 
paths almost surely such that f(a,-) is measurable-F, for all a € [0,00). 
Then 


(i) if f is a square integrable stochastic function, then f is stochastically 
integrable; 

(ii) af liec) f is a square integrable stochastic function, then lp )f is 
stochastically integrable. 


Proof. See Brzeźniak and Zastawniak (1999, pp. 187-188). m 


We can now make sense of the integral Io W dW), that led to the fore- 
going discussion. Put W = (Wi, W2), f = Wi and W = Wy, and apply 
Proposition 7.41. By definition, f = W; is a stochastic function with con- 
tinuous sample paths, with W; (a) measurable-F,, a € [0, 00). To verify that 
10,1} f = ljo, W1 is square integrable, as required for Proposition 7.41 (22), 
we write E [|f 1io,ı,(a)Wı (a)?da] = Ef W,(a)?da = h E(W,(a)?)da = 
i. ada = |a? /2\., = 1/2 < oo. f = W, is therefore stochastically integrable 
on [0,1], ensuring that ips W dWz is well defined. 

We can extend the notion of Ito stochastic integral to vector- or matrix- 
valued integrals; these forms appear commonly in applications. Specifically, 
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let f be m x 1, let F be m x k and let W be k x 1. Then 


I f dW’ 
b 


is an m x k matrix with elements 


[ hams, t= Vice ] = lek 
b 


[ew 
b 


is an m x 1 vector with elements 


and 


k c 
> Fi; dW;, t=1,...,m. 
b 
For example, the matrix 


| waw’ 
b 


is the k x k matrix of random variables with elements te W,dW;, i,j = 
| eee 8 

We now turn our attention to the limiting behavior of fe WindWon. For 
this, consider a more general setup in which we have (Un, Vn) > (U,V), 
and consider the limiting behavior of 


1 
i UndV',. 
0 


Contrary to what one might at first expect it is not true in general that 
Jo Und, => fy Udy’. 

To see this, recall from thon 7.21 (b) that n~! Fpa 4 er > > (07/2) 
(W(1)? — 72/07), where rT? = plim n~! Deane: e2. Writing Wn(at-1) = 7 1/ 
a e;/o, we have 


n`! N Yie: = on /2 ` Whn(at—1)Et 
t=i 


t=} 


= o? ` Wh (at—1)(Wn (at) — Wn(at-1)) 


t=1 


1 
= o? f Wnd Wan ) 
0 
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where the first equality is taken from the proof of Theorem 7.21(b), the 
second uses the fact that et = on!/2(W,(at) — Wn(ae-1)), and the third 
uses the definition of a stochastic integral. 

Now if (Un, Vn) => (U,V) were to imply 


1 1 
f Uu,,dV', => f udv”, 
0 0 
then Wn = W (as assumed in conditions 7.21(i7)) would imply that 
1 1 
f WnadWn > f W aW. 
0 0 


(Set Un = Vn = Wn and U = VY = W.) It is a standard exercise in 
stochastic integration to show that 


i: “wa = (1/2)(W(1)? ~ 1), 
0 


so we would then have 


n`! a > (o? /2)(W(1)? — 1). 


t=1 


But we have already seen that 


n 1 
aiie = o | Wade 
t=1 0 
=> (07/2)(W(1)? — 7? /o?). 
We now see that the convergence we might have naively expected does not 
hold in general, because we can have 7? Æ a”. Clearly, more is going on 


than might appear at first glance. 
To see what this is, let {7,} and {e+} each obey the FCLT, and write 


t 
Unlar) = aa hafan 
s=1 


t 
Vala) = nY es/os, 
s=1 
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with a; = t/n and the definitions o? = limpo var(n~!/? Y$; n), and o2 
= limno var(n71/? Yf et). Then 


1 n 
f, Undva = S Unlat-1)(Valat) — Vn (ae-1)) 


= n71? N” Un(ar—1)et/o2 
t=1 
n t-l 


= n`! ` > 7s€t/ (0102). 


t=2 s=1 


Taking expectations, we have, say, 


E ( [ vava) E (n= > Sones) 


I 


t=2 s=1 
n t-l 
= at YOS E (nti) ore») 
t=2 s=1 
= Apr. 
The value of A, depends on the covariances E (ņ,€t), s = 1,...,t — 1, 
t = 1,... ,n. If these are all zero, as happens, for example, when 7), = Et 
and {e+} is either independent or a martingale difference sequence, then An 
is zero. If, however, €; is correlated with past values 7,, s = 1,... ,t — 1, 


then An need not be zero. 
By assumption, we have that Un => U and Vn => V, where U and V 
are both standard Wiener processes (not necessarily independent of each 


other). Now the expectation of b U dV depends not at all on the covariance 
properties of {e:} and {7,}. In fact, it can be shown that E( h UdV) = 0. 


Pa 1 
Thus, at a minimum, we must recenter Jo UndVn to 


1 
l UndVn =F, An, 
(0) 


to ensure that its expectation matches that of We Udv. 

In fact, for the stochastic processes we study, this recentering is enough 
to give us what we want. Conditions ensuring that {e:} and {7,} obey the 
FCLT together with the assumption that A, — A are sufficient to deliver 


1 1 
| Uu,dVn = An > i uU dy, 
0 0 
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a consequence of Theorem 4.1 of DeJong and Davidson (2000). DeJong and 
Davidson’s proof is sophisticated and involved. We therefore content our- 
selves with stating corollaries to DeJong and Davidson’s (2000) Theorem 
4.1 that deliver the desired results for our cointegration setup. 
For this, we let {7,} and {ez} be m x 1 and k x 1 vectors, respectively, 
and define 
[an] 
Un(a) = EnS n, 
t=1 
[an] 
Vrla) = a does, 


t=1 


where X; = limn—.oo var(n7}/? oa m) and Sq = limp, var(n—1/2 
pier Ez). We also define 


n t-l 


Aaa g YY Enee”. 


t=2 s=1 


Theorem 7.42 (i.i.d. stochastic integral convergence) Suppose that 
{n,} and {e+} are i.i.d. sequences each satisfying the conditions of the 
multivariate Donsker theorem. Suppose also that A, — A, a finite matriz. 
Then 


1 1 
Un Va; | u,,dV,, —A,)=> u,v, | udv’). 
0 0 


Proof. The argument of the proof of DeJong and Davidson (2000, Theorem 
4.1) specialized to the i.i.d. case delivers the result. = 

Note that we have stated the conclusion in terms of the joint convergence 
of Un, Vn and es undV, — An. This is convenient because it permits easy 


application of the continuous mapping theorem, as DeJong and Davidson 
(2000) note. 


Exercise 7.43 (i) Verify that A = 0 when {n,} and {e+} are independent 
and the conditions of Theorem 7.42 hold, so that i undV,, => i udy’. 
(ii) Find the value of A when n, = €1_1. 


We now have what we need to establish the limiting distribution of the 
least squares estimator for cointegrated random walks. 


Exercise 7.44 (Cointegrated regression with random walk) Sup- 
pose the following two conditions hold. 
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(i) {et} and {n,} are independent i.i.d. processes with E(e+) = E(n,) = 0, 
0 < a? = var(n,) < œ and 0 < 0% = var(et) < œ. 


(ii) % = X46, + €t, t = 1,... , where Xi = Xi-1+,t = 1,..., and 
Xo =0, with B, € R. 


Then for independent standard Wiener processes Wi, W2, 
(a) a aA = 0? Jy Wila) da; 
0) no? Eia Xer = 0102 fo Wi (a) dW2(a); 
(c) (Bn — Bo) = 02/01 [Ji Wila)da] fi Wi(a) dwala); 
(d) B, > Bo 


Now suppose that Xz is a k x 1 vector of integrated processes, X; = 
X:-1 + 7, and that Y; = X6, + ct. Further let ¢, = (ni, ct)! obey the 
multivariate FCLT. {X:+} and {Y;} will then be integrated though not nec- 
essarily random walk processes. To handle this case, let 


x = lim var(n-/?'S °¢,), 
t=1 


and define the (k + 1) x 1 vector 
é = E7, 
so that ¢, = EA, Using this, we can write 


Xe = Xit 
= X:ı PD DE; 


where Dj is the k x (k+ 1) selection matrix such that DiC; = m. Similarly, 
we can write 


Yı == XB, +Et 
Xib + D3 EPE, 


where Dj is the 1 x (k + 1) selection vector such that D5; = €t. 
Next, we write 


t 
Wala) =n S En at = t/n. 
s=1 
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With this notation we have 
n? Y XX) =n! YO D EYW, (at) Wn (ae) DD. 
t=1 t=1 
Provided that {€,} obeys the multivariate FCLT, we have 


n 1 
n? X XX) > DD"? f W(a)W(a)'da 5)/?’D,. 
0 


t=1 


We also have 


n n n 
n! ) Xic = n`! ) Xi1etn! ) Et 
t=1 t=1 t=1 


= SO DIE Wr (at-1)(Wn(at) = Wrlar-1) E” Do 


t=1 
n 
+ n`! ) Et 
t=1 


1 n 
ps? f WrdWy, =D. +271 So mes. 
0 


t=1 


The second term, n7! So 4 7,€t, can be handled with a law of large num- 
bers. To handle the first term, we apply the following corollary of Theorem 
4.1 of DeJong and Davidson (2000). 


Theorem 7.45 (Heterogeneous mixing stochastic integral conver- 
gence) Suppose that {n,} and {€+} are vector-valued mizing sequences each 
satisfying the conditions of the heterogeneous mixing multivariate FCLT 
(Theorem 7.30). Suppose also that An — A, a finite matriz. Then 


1 1 
(Un Vn. | UndV», = An) > (U,V, | Udy’). 
0 0 


Proof. This is an immediate corollary of Theorem 4.1 of DeJong and 
Davidson (2000). m 


Theorem 7.46 (Cointegrating regression with mixing innovations) 
Suppose 


(i) {C, = (n, c:)'} is a globally covariance stationary mizing process of 
(k+1)x1 vectors with ¢ of size —r/(2r — 2) or a of size —r/(r — 2), 
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r > 2, such that E(¢,) = 0, E(|¢,,|") < A < œ, i = 1,...,k +1 
and for allt. Suppose further that © = limp—soo var(n7 1/2 Y% Ce) is 
finite and nonsingular. 


(ii) Yı = X18, + Et, t =1,2,... , where Xit = Xi_14+7,,t =1,2,... , and 
Xp = 0, with Bo e RE. 


Let W denote a standard multivariate Wiener process, let D‘ be the 
k x (k +1) selection matriz such that Di C: = 7, let D} be the 1 x (k + 1) 
selection vector such that D5¢, = £z, let 


n t 
Asim, Sas Soy BGs, 


n—0o 
t=2 s=1 


and let T = limn=oœ 271 Y p1 E (Cy et). 
Then A and TI are finite and 


(a) n? 0", XX} > Did”? l f W(a)W(a)'da| rp, 
O) n=! Eii Kier > iD"? | fp wa" + Al 51D, +T. 
(c) 


i oa pis [| weowiayal sp] 


1 
x [piz / WdW’' + A D Dz r| 
0 


(d) Bn —> Bo- 


Proof. The mixing and moment conditions ensure the finiteness of A and 
T. (a) Under the mixing conditions of (i) {€; = =~ !/2¢,} obeys the hetero- 
geneous mixing multivariate FCLT, so the result follows from the argument 
preceding the statement of the theorem. 

(b) As we wrote above, 


n 1 n 
n} ` Xic, = Ds? f WdW), DD, +n! 2: NtEt. 
t=1 o t=1 
Under the mixing conditions of (i) {7;,€} obeys the law of large num- 
bers, so n`! 5°"_, met ++ T. Theorem 7.45 applies given (i) to ensure 

1 
fo W,dW,, > fo WdW’ + A. The result follows by Lemma 4.27. 
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(c) The results of (a) and (b), together with Lemma 4.27 deliver the 
conclusion, as n(G,, — Bo) = (n2 Th, ax ah are, Cre 

(d) Ên — B, =n (n(By — Bo)) = 0(1)Op(1) = op(1). m 
Observe that the conclusions here extend those of Exercise 7.44 in a natural 
way. 

The statement of the results can be somewhat simplified by making 
use of the notion of a Brownian motion with covariance XX, defined as 


B = E! W, denoted BM(X), where W is a standard Wiener process 
(Brownian motion). With this notation we have 


=1 


n(n- Bo) > LD; if B(a)B(a)'da! D,| 
x [p Ee BdB’ + ean D2 + r! | 


In conclusion (c), we observe the presence of the term 


Ip; | I l B(a)B(a) da D: È (DIE'PAE"” D +T). 


The component involving A arises from serial correlation in ¢,; the com- 
ponent involving I arises from correlation between 7, and ez. As in the 
case of a unit root regression, the effect of serial correlation in the errors 
is not to induce inconsistency of Bn: instead, the asymptotic distribution 
exhibits a bias. Similarly, the correlation between the regressors X; and 
the errors €; also does not result in inconsistency of 3,,, as it does in the 
results of previous chapters. Instead, the effect is again the more modest 
effect of an asymptotic bias in G,,. As in the case of regression with a unit 
root, we note that even when A and T vanish, the asymptotic distribution 
still is not centered around zero, due to the presence of correlation between 
i B(a)B(a)'da and J BdB’. Asymptotic bias thus remains, although in- 
consistency does not arise. 

The least squares estimator Bn is only the simplest of a fascinating ar- 
ray of estimators proposed for cointegrating regressions, and the results 
depend heavily on the placement of the unit roots and the nature of the 
cointegrating relationships. We cannot address these issues here; rather, 
our intent is that the material presented will provide basic understand- 
ing and useful tools for the reader interested in delving more deeply into 
that literature. For further reading, the reader is directed to the works of 
Hansen (1992a,b), Johansen (1988, 1991, 1996), Park and Phillips (1988, 
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1989), Phillips (1987), Saikkonen (1992), Sims, Stock, and Watson (1990), 
and Wooldridge (1994). 
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CHAPTER 8 


Directions for Further Study 


Although the results of the previous chapters cover a considerable range 
of the possibilities of interest to economists, they also constitute but a 
modest entry into the rich realm of modern econometrics. We may view 
the territory before us as extending in a number of different though re- 
lated directions. Among these are more general data generating processes 
(DGPs); more general specifications for the models used to analyze these 
DGPs; estimation procedures alternative to least squares and instrumental 
variables; and consideration of the consequences and detection of model 


misspecification. 


8.1 Extending the Data Generating Process 


Loosely speaking, there are three dimensions in which the data generating 
processes we consider can be extended: moments, memory, and hetero- 
geneity. For example, although we consider unit root DGPs, we have not 
discussed DGPs that involve deterministic time trends. Nevertheless, the 
Markov law of large numbers (Theorem 3.7) or the McLeish law of large 
numbers (Theorem 3.47) can be useful in establishing consistency in mod- 
els with trending explanatory or instrumental variables. In fact, consistency 
may happen “faster” in models with these variables because the error vari- 
ance may become quite negligible in comparison to the magnitude of the 
regression function X;G,. Asymptotic normality can be established with 
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the help of the Lindeberg or Martingale-Lindeberg central limit theorems 
(Theorems 5.6 or 5.24). In fact, conditions ensuring asymptotic normal- 
ity in models with nonstochastic and possibly trending variables were the 
subject of careful attention very early on (Grenander, 1954) and there is a 
well-developed general theory now available (e.g., Crowder, 1980). 

Deterministic time trends as usually considered constitute a growth in 
the first moment of the dependent variable. Trends may occur in higher 
moments as well. The key to handling such cases is to have available laws 
of large numbers, central limit theorems and/or functional central limit the- 
orems that permit such possibilities. For example, Wooldridge and White 
(1988) give FCLT results for processes with possibly trending moments of 
higher order. 

We have considered DGPs satisfying the memory requirements of er- 
godicity and mixing. Recall also that we imposed the mixingale memory 
requirement to establish the stationary ergodic CLT and FCLT. Although 
it constitutes a restriction for stationary ergodic processes, the mixingale 
notion can also be used to relax the memory conditions in the mixing con- 
text. Specifically, by considering stochastic processes that depend on an 
infinite history of an underlying mixing process, but which depend primar- 
ily on the “near epoch” of the mixing process, we can obtain processes that 
have longer memory than a simple mixing process but which inherit enough 
of the properties of the underlying mixing process to satisfy the mixingale 
condition. In fact, it can be shown that such “near epoch dependent” func- 
tions of a mixing process are mixingales that are sufficiently well behaved 
so as to satisfy laws of large numbers, CLTs and FCLTs. 

Such conditions were introduced by Ibragimov (1962) and treated by 
Billingsley (1968). McLeish (1975a,b) used these conditions in developing 
laws of large numbers, CLTs and FCLTs, which were introduced to econo- 
metrics by Gallant and White (1988). See Gallant and White (1988) and 
Davidson (1994) for further discussion and details. 

In Chapter 7, we considered integrated (unit root) processes, which ex- 
hibit considerably more dependence than mixing, mixingales, or ergodicity 
permit. We see, however, that the large sample behavior of our estima- 
tors in these cases depends on the FCLT, whose validity is based on the 
memory properties of the innovations to the integrated process. A memory 
requirement intermediate to the mixingale and integrated processes consid- 
ered here is that of fractional integration, a generalization of the notion of 
integrated processes. A review of such processes is given by Sowell (1992) 
and Baillie (1996). The tools required to study the large sample properties 
of estimators associated with fractionally integrated DGPs are extensions 
of the FCLT and the other tools of Chapter 7. With fractional integration, 
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the FCLT analogs deliver convergence to “fractional Brownian motion.” 
See, e.g., Taqqu (1975) for further discussion. 

To keep our discussion of the FCLT in Chapter 7 relatively simple, we 
impose the global covariance stationarity assumption, restricting the het- 
erogeneity of the DGP more than is necessary to deliver the FCLT or 
analogous results. See Wooldridge and White (1988), Davidson (1994), and 
DeJong and Davidson (2000) for results that do not require global covari- 
ance stationarity. 


8.2 Nonlinear Models 


Throughout, we have restricted attention to models linear in the parame- 
ters, although we have allowed nonlinear restrictions among the parameters 
to hold. A more general model that contains many situations of interest to 
economists can be written as q:(Xz, Y+, B), where for some 8, it is assumed 
that ar (Xz, Yi, Bo) = Et. 

In the particular case we studied, the model has the form 


ae(Xz, Y+, 8) = Yı — X18. 


Situations in which the elements of the dependent variable Y; take values 
in a restricted subset of the real line are often of considerable interest. For 
example, suppose Y; is a scalar that can take only the values 0 or 1 (the 
“limited dependent variable” case). In this case a relevant model is given 
by 


(Xt, Yı, B) = Yı — F(X;8), 


where F is a cumulative distribution function (e.g., the normal or logistic 
c.d.f.). We interpret F(X;,6) as a model for E(Y|X) = P[Y; = 1|X:]. 

There is a vast range of other possibilities of which the linear model is 
merely a simple and convenient special case. Further, the nature of the 
model adopted and the other knowledge available about the DGP can play 
a determining role in the estimation procedures one employs. 


8.3 Other Estimation Techniques 


To estimate parameters of the model q:(X:z, Y+, 8B) just discussed, we can 
employ the method of moments introduced and described briefly in Chapter 
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4. Specifically, if we have instrumental variables Z; available such that 
E(Ztez) = 0, then we can attempt to estimate 3, by solving the problem 


min q(8)/ZP,Z’q(8), 


where q(Q) is the np x 1 vector with tth block qzu(Xz, Y+, B), so 


Z'q(B) = ` Zed: (Xz, Y+, 8). 


t=1 


This corresponds to the method of moments procedure with g(X+, Y+, Zz, 3) 
= Ziqt(Xz, Y+, 9). 

The properties of this method of moments estimator have been studied 
by Amemiya (1977), Burguete, Gallant, and Souza (1982), and Hansen 
(1982), among others. 

To establish properties for estimators of nonlinear models analogous to 
those obtained here, we need somewhat more powerful tools than those 
given. In particular, repeated use is made of uniform laws of large numbers 
and the mean value theorem for random functions (e.g., see Jennrich, 1969, 
and White, 1994). 

A leading alternative to the method of instrumental variables studied 
here is the method of maximum likelihood. In fact, if one assumes that 
the disturbances £; are independent and identically distributed as multi- 
variate normal with unknown covariance matrix, then the optimal IV esti- 
mators of Chapter 4 can be shown to be asymptotically equivalent to the 
maximum likelihood estimator under general conditions. There is a broad 
range of situations where maximum likelihood and instrumental variables 
are asymptotically equivalent (see Hausman, 1975, and Amemiya, 1977), 
although this equivalence fails for the general case of nonlinear models pre- 
viously mentioned. In that case, maximum likelihood can be shown to be 
more efficient than instrumental variables (Amemiya, 1977). 

Use of the method of maximum likelihood requires an assumption about 
the distribution of the errors, whereas the instrumental variables method 
does not. Thus, the method of instrumental variables is available in sit- 
uations where a knowledge of the error distribution is absent or suspect. 
Nevertheless, maximum likelihood estimation can be conducted as if the er- 
rors have the assumed distribution, whether this assumption is valid or not. 
This procedure is known as quasi-maximum likelihood estimation, a mem- 
ber of the class of M-estimators (Huber, 1967), which contains a wealth 
of useful and interesting estimators. By selecting an M-estimator appro- 
priately, it is possible to obtain estimators that are robust to failure of 
distributional assumptions or to certain plausible kinds of data errors. 
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Again, the study of these estimators requires, among other things, use 
of uniform laws of large numbers and mean value theorems for random 
functions. A general treatment of these estimators that also highlights the 
parallels with IV estimators is that of Gallant and White (1988). 


8.4 Model Misspecification 


Throughout this book, we have maintained the assumption that the data 
generating relationship is known to be 


Y: = X bot Et, E= EA 


It would indeed be fortunate if the relationship between X; and Y; were 
ever truly “known.” Owing to the complexity of economic phenomena, it is 
perhaps more realistic to suppose that the relationship between X; and Y; 
is unknown. In this case, a linear relationship such as that just given can 
be viewed as a convenient approximation but not necessarily as a defini- 
tive description of the relationship between X; and Y;. It then becomes 
important to consider questions such as “How is this approximation to be 
interpreted?”; “What are the properties of the estimated parameters of the 
approximation?”; “How can the approximation be improved?”; and “How 
can we tell if our approximation is exact?”. 

For a discussion of these issues that builds on the material in this book 
in a framework encompassing several of the extensions discussed in this 
chapter, the reader is referred to Estimation, Inference, and Specification 
Analysis (White, 1994). 
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Solution Set 


Exercise 2.8 

Proof. Let an= Anbn, where An = [Anij] and bn = (bn1,bn2,-.- , bnk)’. 
Then ani = JD}; Anijbnj. Since Anij = 0(1) and ba; = O(1), Anijbnj = 
o(1) by Proposition 2.7 (iii). By Proposition 2.7 (ii), ani = o(1) because it is 
the sum of k terms, each of which is O(1). It follows that an = Anbn = o(1). 
E 


Exercise 2.13 

Proof. Since Z'X/n 2S, Q and P, &5 P, it follows from Proposi- 
tion 2.11 that det(X’ZP,Z’X/n?) 45 det(Q’PQ). Since Q has full col- 
umn rank and P is nonsingular by (iii), det(Q’PQ) > 0. It follows that 
det(X’ZP,, Z'X/n?) > 0 for all n sufficiently large almost surely, so that 
(X’'ZP,,Z'X/n?)—‘exists for all n sufficiently large a.s. Hence 


By = (X'ZP,Z!X/n?)-1X'ZP, ZY /n? 
exists for all n sufficiently large a.s. Given (i), 
Bn = B, + (X'ZPpZ’X/n?)X'ZP,Z'e/n?. 
It follows from Proposition 2.11 that 
B,, > B, + (Q'PQ) QP- 0=8, 


given (ii) and (iii). m 
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Exercise 2.20 

Proof. Since Qn = O(1) and Pn = O(1), it follows from Proposition 2.16 
that det(X’/ZP,Z’X /n?)—det(Q’, PnQn) 25 0. Given (iii), it follows from 
Lemma 2.19 that {Q PnQn} is uniformly positive definite, so 


det(Qi PnQn) > 6 >0 


for all n sufficiently large. It follows that det(X’ZPnZ’X/n?) > 6/2 > 0 
for all n sufficiently large almost surely. Hence 


Bn = (X'ZPnZ'X/n?)-1X'ZP,Z'Y /n? 


exists for all n sufficiently large almost surely. Given (i), Bn = B, + 
(X’ZP,Z’X /n?)-1X'ZPpZ’e/n2. Given (ii) and (iii) it follows from Prop- 
osition 2.16 that 3, — (Bo + (Qi PnQn)7!Q). Pn x 0) £5 0, that is, 


~ 


Bn By. m 


Exercise 2.22 

Proof. 

(i) By definition, there exist sets F, G C 2 with P(F) = P(G) = 1, 
where for all w € F it holds that Jan(w)n~| < Ag for al n > Na, 
and for all w € G it holds that |bn(w)n7#| < A for all n > Nọ. Set 
A = ^a ô; and N = max( Na, Nz), then for all w € FAG it holds that 
lan (w)bn(w)n—*n-| < A for all n > N. Since P(F NG) = 1 the first 
result follows. To obtain the second result, let A = Aa + ôb, and set N 
as before. Then for all w € FAG it holds that |(an(w) +bn(w))}n™*| < 
lan (w)n + Ba(ws)n*| < lan(w)n>] + [ba(w)n#] < Aa + dy =A 
for alln > N. 

(ii) By definition, there exist sets F, G C Q with P(F) = P(G) = 1, 
where for all w € F it holds that an(w) = o(nò) and for alw € G it 
holds that bn(w) = o(n“). By Proposition 2.7, an(w)bn(w) = o(nt") 
and an(w) + bn(w) = o(n™) on FNG. Since P(F NG) = 1 the results 
follow. 

(iii) Define F as in (i) and G as in (iz). The result follows immediately by 
Proposition 2.7 and the fact that P(F A G) = 1. 


Exercise 2.29 

Proof. The proof is identical to that of Exercise 2.13 except that Proposi- 
tion 2.27 is used instead of Proposition 2.11 and convergence in probability 
replaces convergence almost surely. @ 


Solution Set 215 


Exercise 2.32 

Proof. The proof is identical to that of Exercise 2.20 except that Proposi- 
tion 2.30 is used in place of Proposition 2.16 and convergence in probability 
replaces convergence almost surely. m 


Exercise 2.35 
Proof. 


(i) Since an = Op(n*) and bn = Op(n“), we know that for a given e > 0 
there exist Ag, Av, Nac, and Noe such that P(w : |n-*a,| > 
Aae) < €/2 for n > Nape and P(w : |n~#b,| > Abe) < e/2 for 
n > Ne e. Define N = max(Na,-, Noe e) and Az = Ag,-Ao,-. Now if 


[nòtan (w)bn(w)| = [n7 ^an (w)|In~““bn(w)| > Ac = Aa, A, 
then |n~*a,,(w)| > Aae or [ntb (w)| > Ab e. So 
P(w : |n an(w)bnlw)| > A) < P(w:|nan(w)| > Aae) 
+P(w: [nbn (w) > Ave) 
< e/2+e/2=€ 
for n > N, which proves that anbn = Op(n>*#). 
Next, let Al = Ag. + Ôb e. Since n> nE > n" it follows that 
P(|n~“(an +bn)| > AL) < P(\(n~an +n b,)| > AL) 
< P(|(n~*a,| + |n~“bn| > AZ) 
< P(|(n7^an] > Dae) + P(Intbn| > An) 
<e/2 TESE 
for n > N, which proves that an + bn = O,(n*). 


(ii) We have nan —> 0, n-“b, > 0. It follows from Proposition 2.30 
that n~a,n7H#b, = n-~PtH a,b, > 0 so that anbn = Op(nr»tt), 
Consider {an + bn}. Since {an} and {bn} are o,(n“*), we again apply 
Proposition 2.30 and obtain n-“an + n7~“bp = n~" (an + bn) = 0. 
We need to prove that for any e > 0, there exist 6. > 0 and N: € N 
such that 


Ner 


(iii 


P(|n->#anbn| >f) <E 


for all n > N. By definition there exist Na, € N and Ag. < oo 
such that P(w : |n~a,| > Aae) < €/2 for all n > Ng,-. Define 6b e = 
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fe /Aa e > 0; then by the definition of convergence in probability, there 
exists Nb e such that P(|n~"bn| > be) < €/2 for all n > Me e. As in 
(i) we have that 


P(e by |S be) Pll Gn] > Aas) + Pn bal Sie) E 


Since this holds for arbitrary choice of e > 0, we have shown that 
Qnbn = Op(n>tH), 

Consider {an + bn}. Since {bn} is also O,(n"), it follows from (i) 
that an + bn = O,(n*). 


Exercise 3.6 

Proof. We verify the conditions of Exercise 2.13. Given (ii), the elements 
of {Zre+} and {Z:X;} are i.i.d. sequences by Proposition 3.3. The elements 
of {Z:€+} and {Z:X;} have finite expected absolute values given (iii.b) and 
(tv.a). By Theorem 3.1, 


Ze/n=n7! 5 Ze: 25 0 
t=1 
and 
Z'X/n=n Y ZX, 45 Q, 
t=1 


finite with full column rank. Since (iv.c) is also given, the conditions of 
Exercise 2.13 are satisfied and the result follows. m 


Exercise 3.13 
Proof. By Minkowski’s inequality, 


z P 1+6 
E| ` X tri€eal*? < Dera 
h=1 h=1 
P 1+6 
< ` aie = p!t5A = A’ 
h=1 
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Exercise 3.14 

Proof. We verify the conditions of Theorem 2.18. By Proposition 3.10, 
{X:c:} and {X:X;} are independent sequences with elements satisfying 
the moment condition of Corollary 3.9 given (iii.b), (iv.a), and using the 
results of Corollary 3.12 and Exercise 3.13. It follows from Corollary 3.9 
that 


X'e/n=n! X Xie; **,0 
t=1 


and 
X'X/n-Mn =n! X XX) -Mn $5 0. 
t=1 


M, = O(1) given (iv.a) as a consequence of Jensen’s inequality and the 
Cauchy-Schwarz inequality. To show this, consider the 7, jth element of Man, 


p 


m ` ` E(XthiXthj). 


t=1 h=1 


Now 
n Pp n P 
n~! ` 3 E(XtniXinj) < n! >., DD |E(XtniX taz)| 
t=1h=1 t=1 h=1 
n p 
=< n7! YOYO Bl XiniXtnyl 
t=1 h=1 
=I LË 2 2 1/2 
Son S (E|Xenil E|Xtnj| ) 
t=1h=1 
n p 
< ma 
t=1 h=1 
= pô’ < œ 


given (iv.a). Hence, the conditions of Theorem 2.18 are satisfied and the 
result follows. m 


Exercise 3.38 

Proof. We verify the conditions of Exercise 2.13. Given (ii), {Z:et} and 
{Z:X;} are stationary ergodic sequences by Proposition 3.36, with ele- 
ments having finite expected absolute values given (iii.b) and (iv.a). By 
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the ergodic theorem (Theorem 3.34), 


Ze/n=n} S Zie: 15, 0 


t=1 


and 


Z'X/n=nr X ZX, = Q, 


t=1 


finite with full column rank. Since (iv.c) is also given, the conditions of 
Exercise 2.13 are satisfied and the result follows. m= 


Exercise 3.51 

Proof. We verify the conditions of Theorem 2.18. Given (ii), {Xzez} and 
{X,X/} are mixing sequences with ¢ of size —r/(2r—1), r > 1, or a of size 
—r/(r—1),r > 1, by Proposition 3.50. Given (777.b) and (iv.a) the elements 
of {Xzez} and {X:X; } satisfy the moment condition of Corollary 3.48 by 
Minkowski’s inequality and the Cauchy-Schwarz inequality. It follows that 
X'e/n “5 0 and X'X/n -Mn 23, 0. Mn = O(1) by Jensen’s inequality 
given (iv.a). Hence the conditions of Theorem 2.18 are satisfied and the 
result follows. @ 


Exercise 3.53 
Proof. (i) The following conditions are sufficient: 


(i) Y; = oY¥t-1 + BoX¢t + Et, [aol <1, lol < 00; 
(ii) {(¥;,Xt)} is a mixing sequence with ¢ of size —r/(2r — 1), r > 1, or 
a of size —r/(r — 1), r > 1; 
(iii) (a) E(X:-jet) = 0, j = 0,1,... and all t; 
(b) E(eret-;) =0, j =1,2,... and all t; 
(iv) (a) E|X2|"t§ < A < œ and Ele?|"t® < A < œ for some 0 < 6 < r 
and all t; 
(b) Mn = E(X’X/n) has det(M,,) > y > 0 for all n sufficiently large, 
where Xt = (Y1, Xt). 
First, we verify the hint (see Laha and Rohatgi, 1979, p. 53). We are 
given that 


n 


S 0 (E|Z:|?)¥/? < 00. 


t=1 
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By Minkowski’s inequality for finite sums, 


Siz < (Sezme) 
t=1 


t=1 


E 


for all n > 1. Hence 


`M 
Ñ 
A 


n p n p 
< lim (Sez) 


t=1 


(Ezre) 


t=1 


by continuity of the function g(x) = x?. Consequently, 


oo p n 
E ye = E lim X 2 
t=1 


t=1 


p 


n P 
< E lim 5 > |Z 
t=1 


p 


t=1 


(Szr) 


t=1 


IA 


where we have applied the monotone convergence theorem to the function 
fn = [pea Zl? with limit f = |De lAl? O< fi < fa <- < fa > 
f). Thus we obtain the desired result. 

Next, we verify the conditions of Exercise 3.51. First {(Y;, Xz)} mixing 
implies {(X;,£:)} = {(Y%_-1, Xt, ez)} is mixing and of the same sizes, given 
(i) and (ii) by Theorem 3.49. 

Next, by repeated substitution we can write Y; as 


Y, F Bo oles T S alei 
i=0 i; =0 


so that 
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[o,¢) [o,¢) 
Yr-1€¢ = 8, 3X 15-16 + X AEt-j—1Et- 


Consider Z; = |80| X 7-0 lol? |Xt—j—1€4] + 20320 lool? |ex—j-1€:|. Since the 
elements £{X,_;- re < Age < œ and Ejes- iét] < Ae < œ (apply 
mg Schwarz inequality and (iv)), we have that E(Z,) < j = pAze + 
= oF Ija Â: < œ. So by Proposition 3.52 we can interchange the summation 
and expectation operators. Hence 


oo foe) 
E(¥2-1€t) = Bo ` aÍ E(Xı-j-1Et) + Xa Elet-j-1Et) = 0, 


given (iii.a) and (iii.b). Therefore 


E(Xter) = (E(%-1e2), E(Xtet))’ = 0 


so that condition (iii.a) of Exercise 3.51 is satisfied. 
Now consider condition (ii7.b). By the Cauchy-Schwarz inequality, 


E| Xie t? < (EXIT Ele |")? <A < 00 


given (iv.a). Further, 


E|Yi elt? < (EYZ H Eleg) < (A'A)? < 00, 
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provided E|Y,? ,|"+° < A’ < œ for some A’. To show this, we write Y; as 
above and apply Minkowski’s inequality: 


oo o 2(r+6) 
Bea? =E B, S Xi J Valer 
= 1 2(r+6) 
< > (E (oR Ae + (E (oe lee al 
j=0 
1 2(r+6) 
3 1 
- [Seeds (ets) oe fae) 
= 
Z 2(r+6) 
a Saib, (A) F5 + ad (A) Fra 
j=0 
2(r+6) 2(r+65) 
= [B (aya) < [Bota a ooo, 
l1—a, < |ia, 


if and only if |a,| < 1, where we have again used Proposition 3.52 to pass 
the expectation operator through the summation operator. Therefore (i) 
and (iv.a) ensure that (iii.b) is satisfied. We now see that all the conditions 
of Exercise 3.51 hold, so that the OLS estimate of (œo, Bo) is consistent. 
(ii) Consider the following data generating process: 


Ye = aoYi-1 + & 


Et = PoEt-1 ttt, 


where we set 3, = 0 and assume E(Y¥;_1uz) = 0, E(¥:-1€t-1) = E(Yre:), 
and E(e?) = var(e;) = o2. Then from Chapter 1 we know that 


E(Y:-1€t) za o2p/(1 i PoXo): 


Therefore, if o? 4 Oand p, # 0, condition (iii.a) of Exercise 3.51 is violated. 
E 
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Exercise 3.77 


Proof. We verify the moment condition of Theorem 3.76. Since E| Z|?" < 
A < œ for all t, it follows that 


fore) fore) 
X BZ) fe < SA 
t=1 t=1 


A Sect) < o. 


t=1 


since )\y2,t7@*”) < oo for any r > 0. The result follows from Theorem 
3.76. m 


Exercise 3.79 

Proof. We verify the conditions of Exercise 2.20. First, note that Z'e/n = 
n`! Da Zn€En, where Zp is the n x l matrix with rows Z:n and en is the 
n x l error vector with elements €:,. By assumption (iii.a), {Ztni€tn, Ft} 
is a martingale difference sequence. Given (iii.b), the moment conditions 
of Exercise 3.77 are satisfied so that n~! ae ZiniEth 25 0, h =1,... p, 
i=1,... ,l, and therefore Z’e/n %5 0 by Proposition 2.11. 

Next, Proposition 3.50 ensures that {Z:X;} is a mixing sequence given 
(ii), which satisfies the conditions of Corollary 3.48 given (iv.a). It follows 
from Corollary 3.48 that Z'X/n — Qn +5 0, and Qn = O(1) given (iv.a) 
by Jensen’s inequality. Hence the conditions of Exercise 2.20 are satisfied 
and the result follows. m 


Exercise 4.18 

Proof. Let V be k x k with eigenvalues \),... ,Ax. Since V is real and 
symmetric it can be diagonalized by V = Q’DQ, where D = diag(A;,... , 
Ax) is the matrix with the eigenvalues of V along its diagonal and zeros 
elsewhere, and Q is an orthogonal matrix that has as its rows the standard- 
ized eigenvectors of V corresponding to A1,... , Ax. Furthermore, since V is 
positive (semi) definite its eigenvalues satisfy A; > 0 (A; > 0), i =1,... ,k. 
Hence, defining D!/? = diag(Aj/”, ede Aey; we can define the square root 
of V as 


vi — Q'D!⁄Q, 
which is clearly positive (semi) definite. From 


(Vay = Q’(D!/?)'(Q’y —vyi/2 


Solution Set 223 


we see that V!/2 is symmetric, and we verify that 
y1/2y1/2 es Q’D!/2QQ'D!2Q 

Q'D)/2p!/2Q 
= V 


) 


where we used the fact that Q is orthogonal. 
The mapping V — (Q, D) is continuous because for any matrix with 
lim, —co Hn = 0, it holds that lim, _... Q(V + H,,)Q’ =D, thus 


lim (V + H,,) = Q’DQ. 


n= 


Since also 4+» VA is continuous (A > 0) then it follows that Vio V!/? is 
continuous. @ 


Exercise 4.19 
Proof. If Z ~ N(0,V) it follows from Example 4.12 that 


v-12z ~ N(0,1), 
since V~1/2VV 71/2 =I. m 


Exercise 4.26 
Proof. Since Z’X/n — Qn a 0, where Qn is finite and has full column 


rank for all n sufficiently large, and Ê, —P, = 0, where P, is finite and 
nonsingular for all n sufficiently large, it follows from Proposition 2.30 that 


X’/ZP,,Z'X/n? — Q PrQnr & 0. 


Also since Q PnQn is nonsingular for all n sufficiently large by Lemma 
2.19 given (iii), (X’ZPnjZ’X/n*)~! and G,, exist in probability. So given 
(i) 


~ 


Vn(B,, — Bo) = (X'ZPpZ'X/n?)7}(X'Z/n)Pyn7/?Z'e. 
Hence, given (it), 
V2(Bn — Bo) — (QhPnQn) Qn Pan /?Z'e 
= {[(X’ZP,,Z/X/n?)—!(X'Z/n)P, 
— (QP Qn) Q, Pn] VPV; en Ze. 
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1/2 


Premultiplying by Dn ‘” yields 


Dr? Vn(Bn — Bo) - Dz? (Qr PnQ) IQ, Pan Ze 
= D7? ((X/ZPnZ’X/n*)~(X'Z/n)Pn — (QP Q) QP] 
AVI V V n PZS, 

Now V3 Pn 12Z'e A N(0,T) given (ii) and 
D7? [(X' ZÊ, Z’ X/n?) (X'Z/n)Ên — (Q,PnQn) Qn Pn] V? = op(1) 
since D37"? = O(1) and V}? = O(1) given (ii) and (iii), and 

(X’ZP,Z/X/n?)*(X'Z/n)Pn — (Qn Pnn) Qa Pn = (1), 
given (iii) by Proposition 2.30. Hence, by Lemma 4.6, 

Dz? /n(Bn — Bo) — D; (Qa PnQn) Qn Pan Ze > 0. 
By Lemma 4.7, Dn ae Jn(B,, — ße») has the same limiting distribution as 
D; (QL Pann) Q Pan Ze. 


We find the asymptotic distribution of this random vector by applying 
Corollary 4.24 with A’, = (QLPaQn) tQ, Pn and In = Dn, which im- 
mediately yields 


D=1/2(Q’ P,Qn)1Q, Pan !/2Z’e & N(0,1). 


Since (ii), (iii) and (iv) hold, D, — Dn —> 0 as an immediate consequence 
of Proposition 2.30. m 


Exercise 4.33 E 7 
Proof. Given that V, — V, => 0 and V, - V, = 0, it follows from 


Proposition 2.30 that Vn — Wn = (Wn — Vn) — (Vn — Vn) & 0. It 
immediately follows from Proposition 2.30 that W, —£M,,—> 0. m 


Exercise 4.34 
Proof. From the solution to the constrained minimization problem we 


know that 


An = (R(X'X/n) R’) (RÂ, — r) 
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and applying the hint, 

An = 2(R(X'’X/n) R) IR(X’X /n) X (Y — XBG,,)/n. 
Now Y — Xf, = Y — X18,,, — X2o, = Y — X1f3,,, = Ë so that 
An = AR(X'X) R) IR(X’X) X'e /n. 
Partitioning R as [0,I,] and X’X as 


i= | XiX, XX? | 


X5X: XX2 

and applying the formula for a partitioned inverse gives 
R(X'X) R’ = (X3(1 — Xi (XX) 7X4) Xe)? 

and 

R(X’X)7'X! = (X3 (1 — Xi (X1 X1) Xi )X2) X41 — Xi (XX1) X1). 
Hence by substitution 


An = 2X4 (I — X1 (X1X1) 1X} )ë/n = 2Xhé/n 


since ë = (I — X, (X41 X1) 1X4 )Y and I — X\(X‘{X,)~! Xj is idempotent. 
a 


Exercise 4.35 
Proof. Substituting V„ = ¢2(X'X/n) into the Lagrange multiplier statis- 
tic of Theorem 4.32 yields 


LM, = nd, Ree /n(X/X/n)]- R’ Än/4. 


From Exercise 4.34, An = 2X4@/n under Ho : Bz = 0. Substituting this 
into the above expression and rearranging gives 


LMpn = në'XR(X'X) IR’ Xë /(ë'ë). 
Recalling that X2R = (0, X2) and é’X, = 0 we can write 
ëE'XR = ë'(0, Xo) = ë'(X1, X2) = &’X, 


which upon substitution immediately yields the result. m 
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Exercise 4.40 
Proof. We are given that s(8) = 83-32. Hence Vs(@) = (—82,—f1, 1)’. 
Substituting s(B.,) and Vs(G@,,) into the Wald statistic of Theorem 4.39 


yields 
Wa = 2(B3n — BinBan)?/Tn, 
where 
Ên = (Bons Bins 1)(X'K/n) 7 Wn(X'X/2)7 (Bons Bins LY. 
Note that Wn re x? in this case. m 


Exercise 4.41 
Proof. The Lagrange multiplier statistic is motivated by the constrained 
minimization problem 


min(Y — XB)'(Y — XB)/n s.t. s(B) = 0. 


The Lagrangian for the problem is 
L= (Y — XP) (Y — XB)/n + s(B)'A 
and the first order conditions are 
OL/OB = 2(X’X/n)B — 2X’Y/n + Vs(B)/A = 0 
OL/OX s(3) = 0. 


Setting Â, = (X'X/n)~!X'Y/n and taking a mean value expansion of 
s(3) around £,, gives 

OL/Op 2(X’X/n)(B — Bn) + Vs(B)'A = 0 

OL/AX = s(B,)+ Vs (B-B,) = 0, 
where Vš’ is the q x k Jacobian of s with ith row evaluated at a mean value 
BE A Premultiplying the first equation by V8’(X’X/ n)! and substituting 
~s(3,,) = V5'(3 — Bn) gives 


An = 2[Vs'(X'X/n)-!Vs(B,)|718(B,,). 


Thus, following the procedures of Theorems 4.32 and 4.39 we could propose 
the statistic 


LM, = nÀ Â Än, 
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where 
A, = 4(V8'(X'X/n)7!Vs(B,,))~!Vs(G,,) (X’X/n) Vn 
x (XX /n)7! Vs(B,,)(V8'(X'X/n)-1Vs(B,,))7?. 


The preceding statistic, however, is not very useful because it depends on 
generally unknown mean values and also on the unconstrained estimate G,,. 
An asymptotically equivalent statistic replaces Vs’ by Vs(G,,)’ and B„ by 


Bye 
LMn = nÀ A Xm, 
where 


= 2[Vs(8,,)'(X’X/n)~!Vs(B,,)]7's(B,,) 


4(Vs(B,,)'(X'X/n)“1Vs(B,,)) 1 V8(B,,)'(X'X/n)7} 
x Vn(X'X/n)? V8(B,)(Vs(B,)' (X'X/n)-*V8(B,,)) 1. 


Än 


To show that £M,, A x? under Ho we note that LMn differs from W, only 
in that V, is used in place of Vn and Bn replaces Bn. Since B, — B, 2,0 
and Vn —V, = 0 under Ho given the conditions of Theorem 4.25, it 
follows from Proposition 2.30 that 


given that Vs is continuous. Hence CM, Be x? by Lemma 4.7. m 


Exercise 4.42 
Proof. First consider testing the hypothesis RG, = r. Analogous to The- 
orem 4.31 the Wald statistic is 


os asi Pa 
Wr = n(RBG, = r)T,, (RBG, = r) +, Xa 
under Ho, where 


r, = RD,R’ 
R(X/ZP,,Z'X/n?)71(X'Z/n)PaVnPn(Z'X/n) 
x (X’/ZP,,Z'X/n?)-1R’. 
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To prove Wn has an asymptotic x? distribution under Ho, we note that 


~ 


Rô, -r = R(B,, — 8.) 
so 
Ty? /n(RB,, — r) = Ta RVn(Bn — Bo), 
where 
Ta = R(Q, Pa Qan) Q, Pn Vn PaP R.. 


It follows from Corollary 4.24 that T7 ?RVn(Bn — Bo) = N(0,1) so that 
To 1/2./n(RB, -— r) = N(0,1). Since D, — Da, => 0 from Exercise 4.26 
it follows that A, — An -=> 0 by Proposition 2.30. Hence W, £ x2 by 
Theorem 4.30. 

We can derive the Lagrange multiplier statistic from the constrained 
minimization problem: 


min(Y — XB)'ZP,Z'(Y —XB)/n? s.t. RG=r. 


The first-order conditions are 


OL/OB 2(X'ZP,Z'X/n7)B — 2(X'Z/n)P,Z'Y/n+ RA =0 
OL/AX = RB-r=0, 


where A is the vector of Lagrange multipliers. It follows that 


2(R(X’ZP,Z'X/n?)—'R’)“1 (RB, — r) 


Bn — (X'ZP,Z'X /n?)-!R/X,,/2. 


An 

Bn 
Hence, analogous to Theorem 4.32, £M, = nr, A, Aa < x2 under Ho, 
where 


A, = 4(R(X’ZP,,Z'X/n?)—!R’)-!R(X’ZP,,Z’X/n?)7! 
x (X/Z/n)PaVnPp(Z'X/n)(X'ZP,Z’X/n?)—} 
xR’ (R(X'ZP,,Z'X/n?)7!R’)7} 
and V,, is computed under the constrained regression such that Vec 


V,, = 0 under Ho. If we can show that CM, — W, > 0, then we can 
apply Lemma 4.7 to conclude CLM, < vee Note that CM, differs from 
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Wn in that Vi is used in place of V,,. Since Vn — Vn _?. 0 under Ho, it 
follows from Proposition 2.30 that CM, — W, = 0. 

Next, consider the nonlinear hypothesis s(G,,) = 0. The Wald statistic is 
easily seen to be 


=ns(8,)'T,, 's(B,), 
where 
, = Vs(G,)'DnrVs(G,,) 


and D, is given in Exercise 4.26. The proof that W,, 3 x? under Ho 


is identical to that of Theorem 4.39 except that Bn e Bn, D n is 
appropriately defined, and the results of Exercise 4.26 are used in place of 
those of Theorem 4.25. 

The Lagrange Multiplier statistic can be derived in a manner analogous 
to Exercise 4.41, and the result is that the Lagrange multiplier statistic has 
the form of the Wald statistic with the constrained estimates Bn and V,, 
replacing 8, and V,,. Thus 


LMn = nd, A, dn; 
where 


An = 2[Vs(B,,) (X’ZP,Z'X /n?)-1Vs(G,,)]~'8(B,,) 


A, = 4[Vs(G,) (X'ZP,Z/X/n?)!Vs(8,)]7" 
x Vs(G,,)'(X'ZP,Z'X/n?)~1(X'Z/n)PpVnPn(Z’'X/n) 
x(X/ZP,Z'X/n?)7 1Vs(8,,)[(Vs(B,,)’ 
x(X’ZP,,Z’X/n?)-1Vs(B,)]7'- 


Now Vn — Vn — 0 and Vs(ĝ,„) — Vs(ĝ,) —", 0 given that Vs(8) is 
continuous. It follows from Proposition 2.30 that CM, — Wan aa 0, so 
that CM, = Xs by Lemma 4.7. m 
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Exercise 4.46 
Proof. Under the conditions of Exercise 4.26, Proposition 4.45 tells us that 
the asymptotically efficient estimator is 


B* (X’XV_ XX)“ X’/XV, XY 
(XX) V, (XX) X’KV XY 
(X'’X)IX’'Y = ĝ,, 


where we substituted X’X =V,. m 


Exercise 4.47 

Proof. We assume that the conditions os Exercise 4.26 are satisfied. In 
addition, we assume 62(Z’Z/n) — o2L, = 0 so that V, = 62(Z’Z/n). 
(This follows from a law of large Auber. ) Then from Proposition 4.45 it 
follows that the asymptotically efficient estimator is 


Ba = (X’Z(65(Z'Z/n))-!Z'K)-! XZ (62 (Z/Z/n)) ZY 
= (X’Z(Z'Z)7!Z'X)1X'Z(Z'Z)- "ZY 
= BsLs- 


Exercise 4.48 
Proof. From the definition 


VAB — Bz) = (X'ZP,»Z'X/n?)-1Vs(B,) 
x [Vs(82)! (X’ZÊn ZX /n®)-V8(B3) "Vs 8%) 
it follows that \/n(G** — 6%) > 0 provided that (X’ZP,,Z'X/n?)7! = 
Op(1), Vs(B;,) = Op(1), and /ns(B>) = 0,(1). The first two are easy 


to show, so we just show that //ns(G@") = op(1). Taking a mean value 
expansion and substituting the definition of B% gives 


Jns(82) = vns(B,) + VnV8,(B,, — Bn) 
= vns(B, ) — Vs! (X’'ZP,,Z'X)~!Vs(G,,) 
x[Vs(B,,) (X’ZP,Z'X)! Vs(6,,)]-* Vns(B,,) 
= (I 5 ae ZP,,Z'X)~Vs(G,,) 


x(Vs(B,,)(X'ZP nZ/X)-!Vs(B,,)|"") vns(B,). 


Solution Set 231 


Now 
I — Vs! (X’ZP,,Z’X)~!Vs(G,,)[Vs(G,,)'(X’ZP,Z'X)-!Vs(G,,)|7} = op(1) 


since V8, — Vs(Bn) = op(1), and Jns(B.,) = /ns(B,) + Vs Vn(B,, — 
Bo) = 0 + Op(1)Op(1) = Op(1) by Lemma 4.6, and the result follows. m 


Exercise 4.56 


Proof. Since X? is F,—ı-measurable then so is Z?(y*) = X?NQ7+. For 


brevity, we write Z? = Z?(7y*) in the following. 
We have 


E°(n7' ` E°(Ze1|Fr-1)) 


t=1 
E°(n X ZE? (e| Fi-1)) 
t=1 


= 0, 


| 


E° (n™! X Zes) 
t=1 


and by the martingale difference property 


Va = var? (n~! X Zfer) 
t=1 


See 5 Zrere,Ly ) 


t=1 


= E°(n7} ` Z? E° (ere) |Fr_-1) Ze ) 


t=1 


= En ) 127027) 
t=1 


= En! >) XOX’). 


t=1 


Set Vp =n Yf XQ XY. 
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With Z? = X?;" we have 


n n -l n n 
p= (Dreva axe) Soxeaevs' Save 
t=1 t=1 t=1 t=1 


A -1 

> xemx) S Xa Y? 
t=1 t=1 

(X/Q71X) K'AY. 


I 


Exercise 4.58 
Proof. We have 


n -l n 
Y xearixe') D Xe Y? 


t=1 


n ~l n 
D xin "x7 SO X27" (XB? + er) 
1 


t=1 
n -l n 
= p+ xearixg) SOKO e. 
t=1 t=1 
Set 
n 
mp (y) =n X K? Qer 
t=1 
and we have 


z —1 
Vn(B% — B°) = (= Y xrarixe') n!/?m9 (y*) 


t=1 
= QUe") 


n —1 
+ (= Y xrara) - wo nme (7") 
t=1 


= Q2(7*) in mR (7*) + ope (1) 
using Theorem 4.54 (iii.a) and (ii). So 


avar? (8x) = QRY Va )Qa(7") 
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where 


Vo(7*) = E° (m ` X20 eeir xr) 
t=1 


= nY E(X eeN X?’) 
t=1 

= nY E°(XP Q'E? (ere, Fi-1)Q7 XP’) 
t=1 

= nl Y E(X? Xe’) 


t=1 


using the law of iterated expectations. 
By assumption, Q2 (7*) = n~! $; 1 E°(K?N7'X?), so 


=i 
avar? (8%) = |n! NO E° (XFX?) 


t=1 


Consistency of Ô,„ follows given Theorem 4.54 (iii.a) and the assumption 
that Q2 (7*) = n! Y; E? (XIN +X?) by the continuity of the inverse 
function and given that Q°(7y*) = V2(7*) is uniformly positive definite. m 


Exercise 4.60 
Proof. Similar to Exercise 4.58 we have that 


n -1 n 
Vae- 8°) = (mxn) nV? 5" X A e: 


t=ł t=1 


n = n 
= (= ` popari) n VAS KO e 


t=1 t=1 
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Es -1 7 —1 
— oT tyo — ogogo- o 
H|» S X x;' = (r ‘SUE (X72, xn) | 


t=1 t=1 


n 
xn 1/2 > KON le 


t=1 


t=1 


A —1 
+ (= Sraa) 


x [n71 VOX Âu E "o DD Xe? 12] 


t=1 t=1 

-1 e —1 
H(i xana) - (iE raran) 

t=1 t=1 


x[n E X An e -n SO Xe] 
t=1 t=1 


n -1 ii 
= (= Sranan] n! Y KQ e: 
t=1 t=1 


+ Ope (1) Ope (1) + O(1)ope (1) + ope (1) ope (1) 


using Theorem 4.54 conditions (iii.a) and (iz) on the second, third, and 
fourth terms. 
It follows that 


s = 
avar” (8%) = (= Serine 1x2) 


t=1 
which is consistently estimated by 
n i -1 
Ba = (miyor) 
t=1 


given Theorem 4.54 (iii.a). m 
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Exercise 4.61 _ 
Proof. (i) Let Wẹ? = (1, W£)’; then 


Q 
3 
I 


n -1 n 
A n7} Y wrw? nS Weer 
t=1 


t=1 


n -1 n 
= (= S wows’ na SO Welee = X? (Bn ~ B°)? 
t=1 


t=1 


n -1 n 
(n= Swe n`! S Wee? 
t=1 


t=1 


n -1 n 
-2 (= > we ) nS WeeX? (Ên — 2°) 
t=1 


t=1 


n -1 n 
$ (= > we n=! S WAX? Bn — B°)? 
t=1 


t=1 


| 


n -1 n 

(E wewe) a SWW a opel) +50) 
= Qo top (1), 

provided that 


n =i 
(= Swi ) = Opo(1), 


t=1 


nS WX? = Op(1), 
t=1 


n YO W2(X? OX”) = Ope(1), 
t=1 


and (Bn Z B°) = Ope (1). 


The third condition enables us to verify 


n! YO WX? (Èn ~ 8°) =n! X WEX? Bn — B°)(Bn — B°) X? 
t=1 t=1 
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n} 2 Wee! & X2")vec((Bn — B°)(Bn — B°)’) 


= Ope(1)ope(1) = ope(1). 


The first condition holds quite generally since Wẹ? is bounded. Set Vd c= 
n`! i We then 


“1 aion 1 we 
n :5 WW? = ( Wo ply" po J: 
i n n pe ~ 


t=1 


If this matrix is uniformly nonsingular, its inverse is Opo(1), and for that 
it suffices that W? is bounded away from zero and one. 


(ii) Consider 


n712 NO KO et — n72 XO XRF Net 


t=1 t=1 

= mAN K Oe — OF "ee 
t=1 

= V2 3° KUW? àn) = (Wea?) er 
t=1 


= nS X?((âmn)! — (af) e L[w? = 0] 


+n-N2 YO X? ((âin + Gan)! — (0% + a8) e1 [We = 1] 
t=1 


= ((61n)7' ~ (af) S xfer [We = 0] 


t=1 
+((@in + @an)7? — (af + a3) n? X XPeel [We = 1] 
t=1 
= Ope (1) Ope (1) + Opo (1)Op0(1) 
To obtain the final equality, we apply the martingale difference central limit 


theorem to the terms n~!/? Yt; Xex1[W? = j], j = 0, 1, and require that 
aj and af + a3 are different from zero. 
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Similarly 


n n 
—1 X j oAQ-lyo _„—l1 > opo—l wo! 
t=1 


= ((@n)~! — (a2)7!)n7} SO XeXP1W2 = 0] 
t=1 
+( (in + @an)~* — (af + a3)7 nS XIX PW? =) 
t=1 


Ope (1)Ope (1) + Ope (1)Ope (1). 
a 


Exercise 4.63 
Proof. Similar to Exercise 4.60 we have 


t=1 


—1 = 
val, ~ B°) = (wa 87) nS KOH, 


—1 = 
= (= > esenr-*x:)) pt? >> X22 "er 


t=1 t=1 
n -1 n -1 
+ (e 82057) - (= Sakean) ] 
t=1 t=1 
xn? SO XQ te 
t=1 
i -1 
+ ae timed) 
t=1 
x [n7 yk oO, een e D t] 
t=1 
—1 —1 
H(p Eraur) - r Aaaa) | 
t=1 1 


x fp PS RO a-n SKIN i 


t=1 t=1 
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n —1 n 

= (o Erana) AS KO e 
t=1 t=1 

+ ope (1)Ope (1) + O(1)ope(1) + opo (1)ope (1) 


applying Theorem 4.54 conditions (iii.a) and (ii) to the second, third, and 
fourth terms. 
Since 


E (XAK?) = E° (E (KR KPIA) 
= E(X?0?-'X?') 


by the law of iterated expectations, we have 


z i 
avar°?(@*) = (m E) ; 


t=1 
To show that the matrix is consistently estimated by 
n A n i =l 
D, =2 (= VR Âa XP An AO K Ân x ; 
t=1 t=1 
it suffices that Dz! — avar?(Q*)~! = ope (1). We have 
D>} — e 


= (1/2)( nT RM We A taD xiu Rp) 
t=1 


-n7t > Ee (KOMP PK 


t=1 
= (1/2)( nS > X07} X? — nS E(X? Ke) 
t=1 t=] 
+(1/2)(n7? XP Ke -n7 DO B(KeE-KY), 


t=1 t=1 


again using the law of iterated expectations. 
Now 


n> KPA K? -n Y E (RRX?) 2 0 
t=1 t=1 
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by Theorem 4.54 condition (iii) and Theorem 4.62 condition (iii) and the 


result follows. m 


Exercise 4.64 


Proof. (i)(a) Set v: = 


X? — E(X} | Fi) so that X}? = Ze’ + vt. Then 
n -1 n 
(= 5 zeze ) DD Zexe’ 


To + (n`! X ZZ?) n Y Zev 
t=1 t=1 


No + Opo (1) 


given n`! $> ZPZ? = Opo(1) and n! 7, Zev, = opo (1); the latter 
follows by the martingale difference weak law of large numbers. 


(b) Substituting 


we have that 


ês = e, + X?' (B° — Bn) 


n 
t=1 
n 
= 5 l ! 
t=1 


+n! Y elb? — Bany X? +n" AOX? (O — Bn )e: 
t=1 t=1 


d 


+n! YO XIA — B,)(B° — Ba)! X?. 
t=1 


Given that n`! Sy, €¢X? and n`! So, XP XP are O,o(1), that 


Ope (1) 


n 
=l oiro 
n ) LZ; Z; 
t=1 


nS > Zire, = 0y0(1), 
t=1 


so that (8° — B.) = Ope(1), it follows that the last three terms vanish in 
probability. To see this, apply the rule vec(ABC) = (C’ & A)vec(B) to each 
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of terms, for example: 
vec(n = 3 Et (8° ~ B,, YX?) = ot Sox? & e,)vec((B° = Bany’) 
t=1 


Opo (1)ope(1) = opo (1). 


Given n`! Y`} EzE! 2., E? it then follows that 5, = E? + Opo (1). 
(ii) To verify condition (iii) of Theorem 4.62 we write 


n n 
= a a-l = = 
n X # ZS, eron y wi AN E 


t=1 t=1 


= ain SS Ei- T D 1 


t=1 


— aa S Z _yyo- ye, 


+ê’ n25 zeg- ler- a, n any gae Et 


t=1 


= R! n! Se; ® Z?)vec( $7 — N°!) 


HRL T), n V2 2 ZD le 
= Oo (1)Oge(1)oge(1) i Ope (1)Op0(1) = opo (1). 


Similarly, to verify condition (vi) of Theorem 4.62 we write 


t=1 


n n 
a—1 
—1 a! ryo o —1 I poyo- l yo 
n X KZ, Xp -n X nora X? 
t=1 


= f'n} 5 (X: ® Z?)vec($7 — E27?) 
t=1 
+Ê — x Jat ys Zo x, 
t=1 


= Op0(1)Ope(1)0p0(1) + opo (1)Opo(1) = opo (1). 
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Exercise 5.4 
Proof. We verify the conditions of Exercise 4.26. To apply Theorem 5.2 
let Z; = V~1/2Z,,e,, where A'A = 1 and consider 


n NO NV Y Ze =n PS Z. 
t=1 t=1 
The summands Z; are i.i.d. by Proposition 3.2 given (ii) with E(Z,) = 0 
given (iii.a) and var(Z:) = A'V-1/2VVT!2A = 1 given (iii.b) and (iii.c). 
Therefore 


n712 Y Zi = n12 SO NV IZE = NV Ze & N(0,1) 
by Theorem 5.2. It follows from Proposition 5.1 that V~1/2n-1/2Z'e £ 
N(0,1) since if Y ~ N(0,I) then XY ~ N(0,1), for all A, XA = 1. By 
(iii.b) and (iii.c) we have that V is O(1) and nonsingular. It follows from 


Theorem 3.1 and Theorem 2.24 that Z’X/n — Q = 0 given (it), (iv.a), 
and (iv.b). Since the remaining conditions of Exercise 4.26 are satisfied by 
assumption, the result follows. m 


Exercise 5.5 
Proof. Given the i.i.d. assumption of Exercise 5.4 it follows that 


V =n! S  E(e?Z:Z') = E(e? ZZ). 
t=1 


Now 

E(e; ZZ) = E(E(e?Z:Z;|Z)) 
E(E(e?|Zt)ZeZi) 
o2E(Z:Z}) = 02L. 


Hence V = 02L. It follows from Exercise 4.47 that the efficient IV estimator 
chooses P = V`! to yield the two stage least squares estimator, 


Boss = (X’Z(Z'Z)—*Z!X)-1X!Z(Z'Z) ZY. 
The natural estimator for V is Vz = 62 (Z'Z /n), where 
6? = (Y — XBasrs)/(Y — XPogzs)/n- 


The conditions of Exercise 5.4 are not quite strong enough to ensure Vn is 
consistent for V. It suffices that we add 
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(i’) Ele?| < œ 
(ii') (a) E |Z} < 00, i=1,... k; 

(b) EIX} | < 00, j=1,... ,k; 

(c) L = E(Z:Z}) is nonsingular. 
Note that (ii'.a) and (it'.b) together imply (iv.a) by the Cauchy-Schwarz 
inequality. 

We show that Z'Z/n > L and 62 -> ø2. Since {Z:Z/} is an iid. 

sequence given (ii), it follows that Z’Z/n = n`! Ð}; ZZ = L by 
Theorem 3.1 and Theorem 2.24 given (ii’.a). Next consider 


aan e= X(bz2sL5 — Bo)) (e - X(Bo515 — Bo)) 
= e'e/n— 2(Bosrs — Bo) X'e/n+ (Posts fe B,)'(X'X/n)(Boszs — B,). 


Now Bo5,9 — G5 “> 0. The elements of Xze: have finite expected absolute 
value given (i’) and (ii'.b). Hence X’e/n = Oa.s.(1) by Theorem 3.1. Sim- 
ilarly, X:X; has finite expected absolute value given (27’.b). Since {XzX}} 
is an i.i.d. sequence it follows from Theorem 3.1 that X’X/n = Oa.s.(1). 
Therefore 


2(Bosrs — Bo)X'e/n => 0 
and 
(Bass — Bo) (X'X/n)(Bzsts — Bo) = 0 


by Theorem 2.24 and Proposition 2.30. Finally, consider 
e'ejn =n! Jr 
t=1 


Now {e?} is an i.i.d. sequence given (ii) with finite expected absolute 
value given (i’). It follows from Theorem 3.1 that n~! >>}, €? — E(e?) = 
ie, O50. 

Hence Vn — V = 62(Z'Z/n) — o2L = 0 by Proposition 2.30. Given 
that o? > 0 and L is nonsingular it follows from Proposition 2.30 that 


P, — P = V3! — V7 = (62(2’Z/n)) 7? — (oL) 4 0. 


This completes the exercise. m 
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Exercise 5.9 
Proof. For an identically distributed sequence {Z+} with E(Z:) = yp, 
var(Z:) = 0? < oo, the Lindeberg condition reduces to 


n—- oo 


lim o7? f (z — u} ad F(z) =0. 
(z=)? >eno? 
Now let 


gn(2) = =u aaae 


Then {gn} is a increasing sequence of functions that converges to the func- 
tion g(z) = (z — „)?. The monotone convergence theorem (see Rao, 1973, 
p. 135) allows us to interchange limit and integral 


lim (z—p)*dF(z) = lim gn(z)dF(z) 
n 00 (z—u)? Sena? TCO — 00 
= / lim gn(z)dF(z) 
z 1 Cemais 


Now 


lim n = (2-#)'aF(2) 
z—-t)? >eno 


= limo? G -j (z — p)?dF(z) 
n—00 (2—p)?<eno? 

eee se ee = 

=g fo — lim on(z)4F (2) 


Exercise 5.12 
Proof. We verify the conditions of Theorem 4.25. 'To apply Theorem 5.11, 


let Zm = NV, u ? X'e: and consider 


AN AV Ke, =n HY Zne 
t=1 t=1 
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The summands Z,; are independent by Proposition 3.3 given (ii), with 
E(Znt) = 0 given (iii.a), and 


var(/nZ) = AV var(n™!/2X'e)V71/2A 
XV- NVa V7 A=] 


=2 
On 


given (iii.c). By (iii.b) E|Zyz|2*° is uniformly bounded (apply Minkowski’s 
inequality). Hence, for all A, “A =1, 


n2 S Zp =n SO Vg Xie: = NV gn? Xe 4 N(0,1) 
t=1 t=1 


and therefore V} !/?n-1/2X'e Å N(0,I) by Proposition 5.1. 

Assumptions (ii), (iv.a), and (iv.b) ensure that X’/X/n —M, —> 0 by 
Corollary 3.9 and Theorem 2.24. Given (iv.a) Mn = O(1) and uniformly 
positive definite given (iv.b). Since (v) also holds, the result follows from 
Theorem 4.25. m 


Exercise 5.18 
Proof. The following conditions are sufficient for asymptotic normality: 


(i) (a) Ye = Boi Yi-1 + bo2Yt-2 + €t; 
(b) —-l< Boo < Is 


Boo = oi < 1; 
Boy + Bog < 1; 


(ii’) (a) {et} is a stationary, ergodic sequence; 
(b) {ct, Fe} is a martingale difference sequence, where 


FO ois erase) 


(iii') (a) E(e?|Ai-1) = 02 > 0. 
(b) Ele,|* = Ko, 
We verify the conditions of Theorem 5.17. 
(i) is implied by (7’) 


(ii) {(Xi,ee)} = {(Yi-1, Yt-2,€2)} is a stationary ergodic sequence by 
Theorem 3.35, 

(iii.a) The data generating process given in (i’.a) isan AR(2) time series pro- 

cess and condition (7’.b) is the familiar stationarity condition that the 

roots of the polynomial 1— 3, z—,22? lie outside the unit disk. Given 
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(iii.b) 


(iii.c) 


(i'.b) we can write Y; as an infinite moving average, Y; = 50 CjEt—j, 
where cj; = (|G,9|!/*)Ja(j) and |a(j)| < A’ < œ for all j > 0 
(see Dhrymes, 1980, pp. 394-395.) Since |8 2| < 1 it follows that 
> 50 lej| < œ. Thus, Y; is F;-measurable and 

E(Y:-1e|Ft-1) = Yi-1E(E(et|Ft-1)) = 0 

E(¥i-2€t|Ft-1) = Yi-2E(E(et|Ft-1)) = 0. 
Given (it’.a) {Xzez, Fe} is a martingale difference sequence and hence 
a mixingale. Therefore Theorem 5.17 (iii.a) holds. 
We have that E|Y¥;_1e4|/? < (E|Y:-1|*)!/2(Ele:|*)!/* by the Cauchy- 
Schwarz inequality. Hence, if we can show E|Y;_;|* < oo, then con- 
dition (iii.b) is verified. Now by Minkowski’s inequality (see Exercise 
3.53), 

4 


BN = El) cet; 
}=0 


4 


IA 


ee 1/4 
4 
> esl (E lee-al*) 
j=0 
4 


OO 
A X icl < OO. 
i; =0 


By the martingale difference assumption and by stationarity we have 
that 


A 


Vn = nY > E(XteverX{) = 02E(X:X,) 


t=1 
of _E(W2,)  B(%i-1¥i-z) \ _ ang 
o| B21) EWP) ae 
which is positive definite if E(Y,?.,)* — E(¥:-2Yi-1)? > 0 or equiva- 
lently, if E(Y2) > E(|Y:¥:-1|). Now 
E(Y/) 02(1 — Boo) [(1 + Bo2)(1 — Boo) — Bé1)7* 
E(Y%:¥i-1) = Ball + Bo2)(1 — Bo2) — Boil? 
(see Granger and Newbold, 1977, Ch. 1) so E(Y¥?) > E(\¥:¥:-11) 


holds if and only if (1 — 802) > |G,,|. This is ensured by (7’.b), and the 
condition holds. 
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(iv.a) E|Y?2;| < œ is directly implied by E|Y,4.,| < A using Jensen’s in- 
equality. 


(iv.b) That M is positive definite was proven in (iii.c). 


Exercise 5.19 
Proof. We verify the conditions of Exercise 4.26. The proof that 


Vn Ze & NO0, D 
is exactly parallel to the proof that 

V-1/2n71/2X'e Å N(0,1) 
in Theorem 5.17 with Z replacing X everywhere. Next, Z'X/n - Q & 0 
by the ergodic theorem (Theorem 3.34) given (ii), (iv.a), and (iv.b), where 


Q is finite with full column rank. Since the conditions of Exercise 4.26 are 
satisfied, it follows that 


Dz? Vn(B, — B) * N(0,1), 
where 
Dn = (Q/PQ)"'Q/PV,PQ(Q’PQ)"". 
Since D, — D — 0 it follows that 


D! /n(B, — Bo) — Dz! Vn(B,, — Bo) 
= (D7'?D,/? — DD; Vn(B, — Bo) >> 0 


by Lemma 4.6. Therefore, by Lemma 4.7, D~!/2,\/n(G,, — Bo) A N (0,1). 
a 


Exercise 5.21 
Proof. We verify the conditions of Theorem 4.25. First we apply Theorem 


5.20 and Proposition 5.1 to show that Vn !/?n-1/2X'e 4 N(0,I). Consider 


NV PnP Xe =n? SN AV Kier. 


t=1 
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By Theorem 3.49, {AX Vn u Kien is a mixing sequence, with either ¢ of 
size —r/(2r — 1), r > 2, or a of size —r/(r — 2), r > 2, given (ii). Fur- 
ther, E(X’ Vn l Xie) = 0 given (iii.a), and application of Minkowski’s 
inequality gives 


E(\INVz 1? Xie, |"t9) < A < 00 


for some 6 > 0 and for all t given (iii.b). 
It follows from Theorem 5.20 that for all A, A'A = 1, we have 


nV2S* NVI Xe, = NV GP nx’ Ê N(0,1). 
t=1 


Hence, by Proposition 5.1, Vn 
holds. 

Next, X’X/n —M, —> 0 by Corollary 3.48 and Theorem 2.24 given 
(iv.a). Given (iv.a) Mn = O(1) and det(M,,) > 6 > 0 for all n sufficiently 
large given (iv.b). Hence, the conditions of Theorem 4.25 are satisfied and 
the result follows. m 


1/2n71/2X'e Å N (0, I), so Theorem 4.25 (ii) 


Exercise 5.22 
Proof. The following conditions are sufficient for asymptotic normality: 


(i’) (a) Yı z Boi Yt-1 a Boo Wi + Et; 
(b) [Boil < 1, [Bool < 00; 
(ii) {Y}} is a mixing sequence with either ¢ of size —r/2(r — 1), r > 2, or 
a of size —r/(r—2), r > 2; {W,} is a bounded nonstochastic sequence. 
(iii) (a) E(er|7t-1) =0, t=1,2,..., where Fy = o (Yi, Y:-1,...); 
(b) Ela <A<oo, t=1,2,...; 
(c) Vn = var(n—1/2 Yi 1 Xtet) is uniformly positive definite, where 
X: = (Yi-1, We); 
(iv’) (a) Mn = E(X’X/n) has det(M,,) > 6 > 0 for all n sufficiently large. 
We verify the conditions of Exercise 5.21. Since e: = Y; —8o01Yt-1 — Boo Wt, 
it follows from Theorem 3.49 that {(Xz,e¢)} = {(¥i-1, Wi, €t)} is mixing 
with either ¢ of size —r/2(r — 1), r > 2, or a of size —r/(r — 2), r > 2, 
given (ii). Thus, condition (ii) of Exercise 5.21 is satisfied. 
Next consider condition (iii). Now E(Wirer) = WiE (et) = 0 given (iiz’.a). 
Also, by repeated substitution we can express Y; as 


[oe) oO 
Y: = Boo X pha Wi-j T ` Borét—j- 
j=0 j=0 
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By Proposition 3.52 it follows that E(Y:_1¢2) = 0 given (zi7’.a) and (ii7’.c). 
Hence, E(Xzer) = (E(¥%t-1€2), E(Wiet))’ = 0 so that (iii.a) is satisfied. 
Turning to condition (iii.b) we have E|Wre,.|" = A’E|e:|" < A'A < œ 
given (izi’.b), where |Wi| < A’ < co. Also 

Ead < (E|%-1?")'/?(Elee|?") 7 
by the Cauchy-Schwarz inequality. Since Ele,|?” < A < œ given (iii’.b), it 


remains to be shown that E|Y;|2” < oo. Applying Minkowski’s inequality 
(see Exercise 3.53), 


2r 
EY = E Bo S Bir Wi-j + So Borers 
j=0 j=0 
ie EA 2r 
; ; r\l/(2r 
< [Bal YW- + > l (Elele 
j=0 j=0 
2r 
& (Bele Pavey. |B) ||. <i 


j=0 
if and only if |8 1| < 1. Therefore, E|¥:_1e|" < co given (i’) and (iv’.a) 
so that condition (iii.b) is satisfied. Next, condition 5.21 (iii.c) is imposed 
by 5.22 (iii'.c). It remains to verify condition 5.21 (iv.a). Now E|W2|" = 
W|?" < A’? given (iv.a) and E|Y,2|"/2 < œ as shown previously. Hence, 
all conditions of Exercise 5.21 are satisfied so that the OLS estimate of 
(851,852) is consistent and asymptotically normal. m 


Exercise 5.27 
Proof. First, we apply Corollary 5.26 and Proposition 5.1 to show that 
Veen V/27/e À N(0,1). 


Given (iii’.a), {Zret} is a martingale difference sequence with var(n~!/2Z’e) 
= Vp finite by (iii'.b) with det(Vn) > 6 > 0 for all n sufficiently large given 
(zii’.c). Hence, consider 


VE ngs = n2 SSNV Ze 
t=1 


By writing 


p k 
AV- Zier: = ` N Ain ZthiEth 
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we see that the additivity of conditional expectations implies 


k 
E(N V5? Ziet|Fe-1) = >> Ain E (Ztni€tn|Fe-1) = 0, 
h=1 i=1 
as E(Zthi€th|Ft-1) = O given (iii.a). Applying Minkowski’s inequality 
yields 


r 


Ain ŽthiEth 


| 
es 

Y 

Me 


EIN V; P Ze = 


IA 


given (7i7’.b). Further, 


var(N’ V5 1/2n-1/2Z/e) 
NV 7 var(n- 2 Z'e)V7 A = 1 


—2 
on 


for all n sufficiently large. 
Next, consider 


n`! Le NV MTree ZV N: 
t=1 


Since {Zete Z;} is a mixing sequence with either ¢ of size —r/2(r — 1), 
r > 2, or a of size —r/(r — 2), r > 2, by Theorem 3.49 it follows that 
n! SL, ZeereiZt — Van -2 0 given (iii’.b). By Proposition 2.30, 


nS INV EP Zerey VR PA NV PVA PA 


t=1 


=n XO NV; Zee ZV; PA — 1 +> 0. 
t=1 


Hence, the sequence {AV} Á *Zer} satisfies the conditions of Corollary 
5.26, and it follows that AVq)/?n-/2 7", Zeer = NV Pn Ze A 
N(0, 1). By Proposition 5.1, Va‘/?n-}/2Z'e © N(0, 1). 
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Now, given (ii’), (iv.a), and (iv.b), Z’/X/n—-Q, > 0 by Corollary 3.48 
and Theorem 2.24. The remaining results follow as before. m 


Exercise 6.2 
Proof. The following conditions are sufficient: 


(i) Y= Xi6,+e, t=1,2,..., Bo € RF; 
(ii) {(Z;, X;,€:)} is a mixing sequence with either ¢ of size —r/2(r — 1), 
r > 2, or a of size —r/(r — 2), r > 2; 
(iii) (a) E(Zegi€tn|Ft-1) = 0 for all t, where {F;} is adapted to { Zpg:€tn}, 
Gt = Ne pt De ed 
(b) ElZigietal”*® < A < œ and Elļern|" tf < A < œ for some 6 > 0, 
g,h=1,...,p,i=1,... ,l, and all t; 
(c) E(ere,|Ze) = olp, t=1,... n; 
(iv) (a) E|Ztni|"t? < A < œ and E|Xmj|"t®° < A < oo for some 6 > 0, 
h=1,...,p,i=1,...,l, 73 =1,...k, and allt; 
(b) Qn = E(Z’X/n) has full column rank uniformly in n for all n 
sufficiently large; 
(c) Ln = E(Z’Z/n) has det(Ln) > 6 > 0 for all n sufficiently large. 


Given conditions (i)—(iv), the asymptotically efficient estimator is 
Bn = Basis = (X'Z(Z/Z)"1Z/ XK)! X/Z(Z'Z) ZY 


by Exercise 4.47. First, consider Z’Z/n. Now {Z:Zj} is a mixing sequence 
with the same size as {(Z, X},€z)} by Proposition 3.50. Hence, by Corol- 
lary 3.48, Z'Z/n — Ln = n! Ly ZZi, — n71 Y E(ZtZi) +5 0 given 
(iv.a) and Z'Z/n — Ln => 0 by Theorem 2.24. 

Next consider 


62 = (np) (Y — XB,)'(Y — XB) 
(e — X(B,, — B,))'(e — X(B, — Bo))/ (np) 
= e'e/(np) — 2(B, — Bo)'X'e/(np) 

+(Bn — Bo)'(X'X/n)(Bn — Bo)/p. 


As the conditions of Exercise 3.79 are satisfied, it follows that Bn —f, 23, 0. 
Also, X’e/n = Og.s.(1) by Corollary 3.48 given (ii), (iii.b), and (iv.a). 
Hence (Ba —,)X’e/n ~ 5 0 by Exercise 2.22 and Theorem 2.24. Similarly, 
{X,X‘} a mixing sequence with size given in (zi) with elements satisfying 
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the moment condition of Corollary 3.48 given (iv.a), so that X’X/n = 
Oa.s.(1) and therefore (Bn — B,)'(X'X/n)(B, — Bo) P, 0. Finally, consider 


e’e/(np) poe och, 


Now for any h = 1,... ,p, {€%,} isa mixing sequence with ¢ of size —r /2(r— 
1), r > 2, or a of size —r/(r — 2), r > 2. As {e?, } satisfies the moment 
condition of Corollary 3.48 given (iii.b) and E(e,) = o? given (iii.c), it 
follows that 


n n n 
= = E P 
n`! S Enn 2 X Blea) =n : 5 et, — 72 +0, 
t=1 t=1 t=1 


Hence, e’e/(np) —, o2, and it follows that a? Ean o2 by Exercise 2.35. m 


Exercise 6.6 
Proof. The proof is analogous to that of Theorem 6.3. Consider for sim- 
plicity the case p = 1. We decompose Vn — Vn as follows: 


Vn _ Van = n`! yo eta F n! Y zte 2ZiZ,) 


n! DG, -A — Bo) Xie: ZZ; 
< TGs - VXXX; (Bn — By) ZrZt. 


Now {e?Z,Zi} is a mixing sequence with either ¢ of size —r/(2r — 1), 


r > 1, or a of size —r/(r — 1), r > 1, given (iz) with elements satisfying 
the moment condition of Corollary 3.48 given (iii.b). Hence 


mt Debit -nY Bl 2Z1Z1) > 0. 


The remaining terms converge to zero in probability as in Theorem 6.3, 
where we now use results on mixing sequences in place of results on sta- 
tionary, ergodic sequences. For example, by the Cauchy-Schwarz inequality, 


1/2 1/2 
E\XinZeiZrgeel”*? < (Bl|XinZu tO) (BlZzeePO4) <A < 00 
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given (iii.b) and (iv.a). As {XtkZtiZtjEt} is a mixing sequence with size 
given in (ii) and it satisfies the moment condition of Corollary 3.48, it 
follows that 


n7! NO Xie Zei Zeger = n7! NO E(X tx ZtiZr€t) a 0. 
t=1 t=1 


; = p Ss . 
Since Bn — 8, — 0 under the conditions given, we have 


not S (b; — Bo) XietZZ, = 0 
t=1 
by Exercise 2.35. Finally, consider the third term. The Cauchy-Schwarz 
inequality gives E|Xtn.Xtx1ZeiZ1j|"t° < œ so that 


n7! ` XtkXtyZtiZtj— n! ` E(XtkXtyZtiZtj) 2,0 
t=1 t=1 
by Corollary 3.48. Thus the third term vanishes in probability and appli- 
cation of Exercise 2.35 yields Vn —V, 2,0. m 


Exercise 6.7 
Proof. The proof is immediate from Exercise 5.27 and Exercise 6.6. m 


Exercise 6.8 : . 
Proof. Conditions (i)—(iv) ensure that Exercise 6.6 holds for B,, and V,— 


Vn ,0. Next set P, = Vv in Exercise 6.7. Then P,, = Vz! and the 
result follows. g 


Exercise 6.12 
Proof. Exactly as in the proof of Theorem 6.9, all the terms 


(n —T)7 >> ZtEtEt— BAA ee — E(ZiEtEt— Age z) 
t=r+1 


—(n—T)~ iSS Z:X1(B,, — B,)ét- By Aa —T 


t=T+1 


(n—T) =>» Z1e:(B, — Bo) Rees Ze 


t=r+1 


+(n —T)7 D ZX; (Bn — Bo)(Bn -— Bo) Xi- a ia T° 


t=T+1 


Solution Set 253 


converge to zero in probability. Note that Theorem 3.49 is invoked to guar- 

antee the summands are mixing sequences with size given in (ii), and the 

Cauchy-Schwarz inequality is used to verify the moment condition of Corol- 

lary 3.48. For example, given (ii) {Zz€1€t_7Z;_,} is a mixing sequence with 

either ¢ of size —r/(2r — 2), r > 2, or a of size —r/(r — 2), r > 2, with 

elements satisfying the moment condition of Corollary 3.48 given (iii.b). 
Hence 


(n —7)~ >> ZrErEr_7 Ly_, — (n—T)~ >> E(Zr€1€1-7Z;_,) — 40, 


t=Tr+1 t=r+1 


The remaining terms can be shown to converge to zero in probability in a 
manner similar to Theorem 6.3. m 


Exercise 6.13 
Proof. Immediate from Theorem 5.22 and Exercise 6.12. m 


Exercise 6.14 _ i 
Proof. Conditions (i)-(iv) ensure that Exercise 6.12 holds for B, and V,,— 


Va 2, 0. Next set Ê, = Ve) in Exercise 6.13. Then P, = V7! and the 
result follows. m 


Exercise 6.15 
Proof. Under the conditions of Theorem 6.9 or Exercise 6. 12 it holds that 
Vn -Va as 0, so the result will follow if also Vn - V, 2, 0. Now 


—V, = ie —1)n™ > Zee, Zir + Zr Err EZ. 


t=T+1 


Since for each T = 1,... ,m we have 


n 
—1 ~ ~l / p= =r! 
n ` ZtEtEt— Zir + Zt-rčt-rE Z; = 0Op(1) 
t=T+1 


it follows that 


m 


Vn = V, = Y (Wnr = 1)O,(1) 


= S © 0p(1)Op(1) = (1), 


where we used that w,; —> 1. m 
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Exercise 7.2 
Proof. Since % = Xo + ee Zs, we have 


t 


E(X) = E(% +) Z) 


s=l 


0 + SEZ.) = 0, 


s=]1 


and 


var( X) = var(X% + Š Zs) 


s=1 
t 
= O+ X var(Z;) =t0°, 


s=1 


using the independence between Z, and Z, for r Æ s. m 


Exercise 7.3 
Proof. Note that 


Ata — ht, = Zu t Zu-1 + + Zt, 
Xt — Xn = Zn + Zy-1 t + Z+. 
Since (Zt +1, --- , Zt) is independent of (Z1,41,...,Z4,) it follows from 


Proposition 3.2 (ii) that 4%, — Xt, and Xn — %, are independent. m 


Exercise 7.4 
Proof. By definition 
[nd] 
Wn(b)- Wala) = nl? So Z 
t=[na]+1 
= —1/2 E 1/2 
= n" ?([nb] — [na]) 
[nb] 
x([nb] — [na] XO Z. 


t=[na]+1 


The last term ((nb] — [na]) 71/2 DEN 41 2t LN (0,1) by the central 


limit theorem, and n—1/2((ndj — [na])!/2 = (([nb] — [na])/n)!/2 — (b—a)!/? 
as n — oo. Hence W,,(b) — Wn (a) +, N(0,b- a). m 
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Exercise 7.22 

Proof. (i) Let U € C, and let Un be a sequence of mappings in D such that 
Un — U, i.e., bn = du(Un,U) = supgeso,1j Un(a) — U(a)| 0 as n — ov. 
We have 


d,(U2,U2) = sup Un(a)? —U(a)?| 
a€(0,1] 
= sup |U,(a) — U(a)||Un(a) + U(a)| 
a€Í0,1] 
< bn sup [Un(a)+U(a)l, 
a€(0, 1] 


where bn = dy(Un,U). The boundedness of U and the fact that bn —> 
0 imply that for all n sufficiently large, Un is bounded on [0,1]. Hence, 
SUP, fo, 1] Un(a) + U(a)| = O(1) and it follows that d,(UZ,U?) — 0, which 
proves that Mı is continuous at U. 

Now consider the function on [0,1] given by 


_ f int(;4), for0<a<1 
v(a) = { 0, fora=1. 
For 0 < a < 1/2, this function is 1, then jumps to 2 for a = 1/2, to 3 for 
a = 2/3, and so forth. Clearly V(a) is continuous from the right and has 
left limits everywhere, so V € D = D(0, 1). 
Next, define the sequence of functions V,(a) = V(a) + c/n, for some 
€ > 0. Then Vn E€ D and Vn > V. 


Since 
du (V2, V?) = sup [Vn(a)? — V(a)?| 
a€[0,1] 
= sup |(V,(a) — V(a))(Vn(a) + V(a))| 
a€[0,1] 
= (e/n) sup |V,(a) + V(a)| 
a€(0,1] 


is infinite for all n, we conclude that V? + V2. Hence, M; is not continuous 
everywhere in D. 

(ii) Let U € C; then U is bounded and h |U(a)|da < œ. Let {Un} be a 
sequence of functions in D such that Un — U. Define bn = du(Un, U); then 


bn — 0 and ie [U,(a)|da < i |jU(a)|da + bn. 
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f Un(a)? - U(a)?|da 


= [emt Un(a) — U(a)llUn (a) + U(a)|da 


Ny 
> 
l 
om 
2 
£ 
N 
Q 
a 
AN 


IA 


bn Pu ita +U(a)|da 


IA 


bn(bn + 2 | |U (a)|da) > 0 


Mp is continuous at U E€ C. 

(iii) Let V(a) = a for all a € [0,1] and let V,(2) = 0 fr O <a < 
1/n and Vn = V(a) for a > 1/n. Thus d,(V, Vn) = 1/n and Vn > V; 
however, fy log(|Vn(a)|)da ~» fy log(|V(a)|)da, since fy log(|V(a)|)da = —1 
whereas i log (|Vn(a)|)da = —co for all n. Hence, it follows that M3 is not 
continuous. @ 


Exercise 7.23 
Proof. If {c+} satisfies the conditions of the heterogeneous mixing CLT and 
is globally covariance stationary, then (ii) of Theorem 7.21 follows directly 
from Theorem 7.18. Since {e+} is assumed to be mixing, then {e?} is also 
mixing of the same size (see Proposition 3.50). By Corollary 3.48 it also then 
follows that n~! S74, Ef — n~t DL, Ele?) = op(1). So if n7! SL, E(e?) 
converges as ie — oo, then (iii) of Theorem 7.21 holds. 

For ø? = 7° it is required that limn7! $`% gee Blen set) = 0. It 
suffices that i is independent or that {e+, Fe} is a martingale difference 
sequence. W 


Exercise 7.28 

Proof. From Donsker’s theorem and the continuous mapping theorem 
we have that n~?>0)_1 X? > o? fW,(a)?da. The multivariate version 
of Donsker’s theorem states that 


(iP Xiang! Yona) = (71, (2), 72W2(a)). 


Applying the continuous mapping theorem to the mapping 


(ayer [afc a)y(a)da 
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we have that 


n? XO XY, = n`! X o1Win(at-1)o2W2n(at—1) 
t=1 


t=1 


1 
= nor | Win(a)Wan(a)da 
0 


1 
= noz | W, (a)W2(a)da. 
0 


Hence 
Ên = a n~? xY, 
= (o [ wi(a)tda) 0102 n W, (a)Wz(a)da 
= (02/01) ( i Wlada) [ i W,(a)W2(a)da. 
a 


Exercise 7.31 
Proof. Since {1,,€2} satisfies the conditions of Theorem 7.30, it holds that 


(nX naj n" ?Yina)) = (01W1n(at-1), 2Wen(at-1)) 
=> (0,W,(a), a2W2(a)), 
where o? and o2 are the diagonal elements of © (the off diagonal elements 
are zero, due to the independence of 7, and ez). Applying the continuous 


mapping theorem, we find B, to have the same limiting distribution as we 
found in Exercise 7.28. m 


Exercise 7.43 


Proof. (i) If {7,} and {ez} are independent, then E(1,€}) = 0 for all s 
and t. So 


A, = EPn! AS Dima 


and clearly A, œ> A = 0. 
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(ii) If n, = €t_1, then 


n t-l 
Dr nY) Enea 
t=2 s=1 
n t-l 
aS ye Beacha 


t=2 s=1 


An 


If the sequence {e+} is covariance stationary such that p, = E(eze€;_,,) is 
well defined, then 


t-1 n t 


n`! > DS E(és-1€;) = n! `X > E(Et_s€;) 
t=2 s=1 t=2 s=2 


n t 


= Ps 
t=2 s=2 


n 


_ py ae 


n 
s=2 
so that 
— k aL 8) x 
A. = 571/2 (n l 1/2 
= (È 90) aA, 


where in this situation we have 


n n n—l1 
= D p —1 Ka : (n zs s) / 

Dı = X2= lim n 2, 2 Pia = Po + ee m + p4): 
Notice that when {e;} is a sequence of independent variables such that 
Pp, = 0 for s > 1, then var(e,) = pp = ©, and A = 0. This remains true 
for the case where n, = €t, which is assumed in the likelihood analysis of 
cointegrated processes by Johansen (1988, 1991). m 


Exercise 7.44 
Proof. 


(a) This follows from Theorem 7.21 (a), or we have directly 


n 1 1 
n? SOX? = a f Win(a)Win(a)da > 7 | Wi (a)*da. 


t=1 
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(b) Similarly, 


n- So Xie: =o f Win (a)dWan (a)02 > ooa | W, (a) dW2(a), 


t=1 


by Theorem 7.42 and using that A = 0 since {ez} and {7,} are inde- 
pendent (see Exercise 7.43). 


(c) 
n(n- Bo) = (v2 x2) (= re 


een Va Wi (a yaa) [me ) dWo(a). 


(d) From (c) we have that n(B, — Bo) = Op(1), and by Exercise 2.35 
we have that (6, — bo) = n-'n(By — Bo) = p(1)Op(1) = 0p(1). So 
Bn 5, es 


References 


Dhrymes, P. (1980). Econometrics. Springer-Verlag, New York. 


Granger, C. W. J. and P. Newbold (1977). Forecasting Economic Time Series. 
Academic Press, New York. 


Johansen, S. (1988). “Statistical analysis of cointegration vectors.” Journal of 
Economic Dynamics and Control, 12, 231~254. 


(1991). “Estimation and hypothesis testing of cointegration vectors in 
Gaussian vector autoregressive models.” Econometrica, 59, 1551-80. 


Laha, R. G. and V. K. Rohatgi (1979). Probability Theory. Wiley, New York. 


Rao, C. R. (1973). Linear Statistical Inference and Its Applications. Wiley, New 
York. 


Index 


Adapted mixingale, 124 
Adapted stochastic sequence, 58 
Approximatable stochastic function, 194 
AR(1), 48, 49 
ARCH(qg), 102 
ARMA, 49 
Asymptotic covariance matrix, 70, 73 
estimator, 137 
heteroskedasticity consistent, 139- 
146 
heteroskedasticity /autocorrelation 
consistent, 154-164 
heteroskedasticity /moving aver- 
age consistent, 147-154 
Newey-West estimator, 163 
Asymptotic distribution, 66 
Asymptotic efficiency, 83 
Bates and White framework, 93 
IV estimator, 84 
two-stage least squares, 86 
Asymptotic equivalence, 67 
Asymptotic normality, 71 
of Ên, 71 
of Ân, 73 
Asymptotically uncorrelated process, 52, 
139, 154 
avar, 70, 83 


261 


Backshift operator, 42 
Best linear unbiased estimator, 3 
Borel 

sets, 41 

o-field, 40 
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Cramér-Wold device, 114 
functional, 188 
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Fractionally integrated process, 208 
Functional central limit theorem, 175 
Donsker, 176 
heterogeneous mixing, 177 
multivariate, 189 
Liapounov, 177 
Lindeberg-Feller, 177 
martingale difference, 178 
multivariate, 189 
stationary ergodic, 177 


GARCH(1,1), 102 
Gaussian AR(1), 48 
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Generalized least squares, 5 Laws of large numbers, 31 
estimator, 5 ergodic theorem, 44 
efficiency, 101 i.i.d., 32 
feasible estimator, 102 i.n.i.d., 35 
efficient, 105 mixing, 49 
Generalized method of moments, 85 Lévy device, 58 
Likelihood analysis, 210 
Heteroskedasticity, 3, 38, 143 cointegrated processes, 258 
ARCH, 102 Likelihood ratio test, 80 
conditional, 130 Limit, 15 
GARCH, 102 almost sure, 19 
unconditional, 130 in mean square, 29 
Holder inequality, 33 in probability, 24 
Hypothesis testing, 74--83 Limited dependent variable, 209 
Limiting distribution, 66 
Implication rule, 25 Lindeberg condition, 117 
Inequalities 
Cauchy-Schwarz, 33 M-estimator, 210 
Chebyshev, 30 Markov condition, 35 
Cr, 30 Markov inequality, 30 
Hdlder, 33 Martingale difference sequence, 53, 58 
Jensen, 29 Maximum likelihood estimator, 210 
conditional, 56 Mean value theorem, 80 
Markov, 30 Measurable 
Minkowski, 36 function, 41 
Instrumental variables, 8 one-to-one transformation, 42 
estimator, 9 space, 39 
asymptotic normality, 74, 143, Measure preserving transformation, 42 
145, 150, 152 Measurement errors, 6 
consistency, 21, 23, 26, 27, 34, Method of moments estimator, 9 
38, 46, 51, 61 Metric, 173 
constrained, 86 space, 173 
definition, 9 Metrized measurable space, 174 
efficient, 109 Metrized probability space, 174 
Integrated process, 179 Minkowski inequality, 36 
fractionally, 208 Misspecification, 207 
Invariance principle, 176 Mixing, 47 
Ito stochastic integral, 194 coefficients, 47 
for random step functions, 193 conditions, 46 
size, 49 
Jensen inequality, 29 strong, (a-mixing), 47 
conditional, 56 uniform, (¢-mixing), 47 


Mixingale, 125 
Lagged dependent variable, 7 


Lagrange multiplier test, 77 Nonlinear model, 209 
Law of iterated expectations, 57 
Law of large numbers O(n), 16 


martingale difference, 60 o(n*), 16 
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Oa.s.(n*), 23 

0a.s (n`), 23 

O,(n*), 28 

op(n*), 28 

Ordinary least squares, 2 

estimator, 2 
definition, 72 
asymptotic normality, 72 
consistency, 20, 22, 26, 27, 33, 
38, 45, 50, 61 

definition, 2 


Panel data, 11 
Partial sum, 169 
Probability 
measure, 39 
space, 39 
Product rule, 28 


Quasi-maximum likelihood estimator, 79, 
210 


Random function, 169 
step function, 193 
Random walk, 167 
multivariate, 187 
Rell function, 172 


Serially correlated errors, 7 
Shift operator, 42 
o-algebra, 39 
o-field, 39 
Borel, 40 
o-fields 
increasing sequence, 58 
Simultaneous systems, 11 
for panel data, 12 
Spurious regression, 185, 188 
Stationary process, 43 
covariance stationary, 52 
Stochastic function 
approximatable, 194 
square integrable, 194 
Stochastic integral, 192 
Strong mixing, 47 
Sum of squared residuals, 2 
Superconsistency, 180 
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Test statistics 
Lagrange multiplier, 77 
Likelihood ratio, 80 
Wald, 76, 81 
Three-stage least squares 
estimator, 111 
Two-stage instrumental variables 
estimator, 144, 163 
Two-stage instrumental variables esti- 
mator 
asymptotic normality, 146, 151, 153, 
164 
Two-stage least squares, 10 
estimator, 144 
efficient, 111 


Uncorrelated process, 138, 139 
asymptotically, 139, 154 
Uniform asymptotic negligibility, 118 
Uniform continuity, 21 
theorem, 21 
Uniform mixing, 47 
Uniform nonsingularity, 22 
Uniform positive definiteness, 22 
Uniformly full column rank, 22 
Uniqueness theorem, 68 
Unit root, 179 
regression, 178 


Vec operator, 142 


Wald test, 76, 81 
Weak convergence, 171 
definition, 174 
Wiener measure, 176 
Wiener process, 170 
definition, 170 
multivariate, 186 
sample path, 171 


